Assembled exon data for Gobioid phylogenetic study

Data files

Jul 22, 2025 version files 57.01 MB

Abstract

This dataset contains sequence data for hundreds of protein-coding loci used to create a high-resolution phylogeny of Gobiidae which confidently resolved the interrelationships of the major lineages within the family. This is the combined gobioid dataset with the assembled coding sequences of the exons trimmed of their adaptors in fasta format. It does not include the flanking sequences of the exons. This unaligned data was further processed and filtered at several levels of stringency before being used to create phylogenies.

Title: Assembled Exon data for Gobioid Phylogenetic Study

https://doi.org/10.5061/dryad.s1rn8pkb8

Authors: Kendall Johnson*, Luke Tornabene, Chenhong Li, Lukas Rüber, Ulrich Schliewen, Derek Hogan, Frank Pezold

*Principal Investigator Contact Information
Name: Kendall Johnson
Institution: Department of Life Sciences, Texas A&M University – Corpus Christi
Email: kjohnson47@islander.tamucc.edu, kejohnson777@verizon.net

Executive Summary:

This dataset contains sequence data for protein-coding loci used to create a high-resolution phylogeny of Gobiidae which confidently resolved the interrelationships of the major lineages within the family. This is the combined gobioid dataset with the assembled coding sequences of the exons trimmed of their adaptors in fasta format. It does not include the flanking sequences of the exons. This unaligned data was further processed and filtered at several levels of stringency before being used to create phylogenies.

Methods:

This data was obtained using the exon-capture method of Li et al. (2013). RNA baits were designed using EvolMarkers to target approximately 12,000 exons present in eight model fish genomes. The samples used for gene-capture were obtained from collections available in the Fish Systematics and Conservation Lab at Texas A&M University – Corpus Christi or from collaborators at other institutions. This dataset includes 170 samples, which include 158 species and 130 genera. Samples were sequenced on an Illumina HiSeq 2500 or an Illumina HiSeq 4000 using paired-end 150 sequencing. The Assexon pipeline was used to assemble the raw gene-capture data into coding regions without the flanking regions (Yuan et al., 2019). The assembled sequence files are in fasta format and named according to the corresponding gene in the bait set.

List of files:

The aligned, 50%-filtered sequence file data: concat.goby_50pct_JE2.nex.fas
The aligned, 75%-filtered sequence file data: concat.goby_75pct_JE2.nex.fas
The 16,677 assembled loci of the original dataset: Merged_Assembled_Loci.zip

Usage notes:

These fasta (.fas) files can be opened by text-editing software such as Notepad++, or imported into software such as Geneious Prime.

Definitions:

"50%-filtered": filtered from the total original dataset (Merged_Assembled_Loci.zip) to include only genes found in at least 50% of the samples
"75%-filtered": filtered from the total original dataset (Merged_Assembled_Loci.zip) to include only genes found in at least 75% of the samples
exon: The sequence of DNA that is present in the final, mature messenger RNA transcript
fasta: a text-based format for representing nucleotide sequences

Assembled exon data for Gobioid phylogenetic study

Data files

Abstract

README

Methods

Usage notes