Assembled exon data for Gobioid phylogenetic study
Data files
Jul 22, 2025 version files 57.01 MB
-
concat.goby_50pct_JE2.nex.fas
28.64 MB
-
concat.goby_75pct_JE2.nex.fas
3.63 MB
-
Merged_Assembled_Loci.zip
24.74 MB
-
README.md
2.84 KB
Abstract
This dataset contains sequence data for hundreds of protein-coding loci used to create a high-resolution phylogeny of Gobiidae which confidently resolved the interrelationships of the major lineages within the family. This is the combined gobioid dataset with the assembled coding sequences of the exons trimmed of their adaptors in fasta format. It does not include the flanking sequences of the exons. This unaligned data was further processed and filtered at several levels of stringency before being used to create phylogenies.
Title: Assembled Exon data for Gobioid Phylogenetic Study
https://doi.org/10.5061/dryad.s1rn8pkb8
Authors: Kendall Johnson*, Luke Tornabene, Chenhong Li, Lukas Rüber, Ulrich Schliewen, Derek Hogan, Frank Pezold
*Principal Investigator Contact Information
Name: Kendall Johnson
Institution: Department of Life Sciences, Texas A&M University – Corpus Christi
Email: kjohnson47@islander.tamucc.edu, kejohnson777@verizon.net
Executive Summary:
This dataset contains sequence data for protein-coding loci used to create a high-resolution phylogeny of Gobiidae which confidently resolved the interrelationships of the major lineages within the family. This is the combined gobioid dataset with the assembled coding sequences of the exons trimmed of their adaptors in fasta format. It does not include the flanking sequences of the exons. This unaligned data was further processed and filtered at several levels of stringency before being used to create phylogenies.
Methods:
This data was obtained using the exon-capture method of Li et al. (2013). RNA baits were designed using EvolMarkers to target approximately 12,000 exons present in eight model fish genomes. The samples used for gene-capture were obtained from collections available in the Fish Systematics and Conservation Lab at Texas A&M University – Corpus Christi or from collaborators at other institutions. This dataset includes 170 samples, which include 158 species and 130 genera. Samples were sequenced on an Illumina HiSeq 2500 or an Illumina HiSeq 4000 using paired-end 150 sequencing. The Assexon pipeline was used to assemble the raw gene-capture data into coding regions without the flanking regions (Yuan et al., 2019). The assembled sequence files are in fasta format and named according to the corresponding gene in the bait set.
List of files:
The aligned, 50%-filtered sequence file data: concat.goby_50pct_JE2.nex.fas
The aligned, 75%-filtered sequence file data: concat.goby_75pct_JE2.nex.fas
The 16,677 assembled loci of the original dataset: Merged_Assembled_Loci.zip
Usage notes:
These fasta (.fas) files can be opened by text-editing software such as Notepad++, or imported into software such as Geneious Prime.
Definitions:
"50%-filtered": filtered from the total original dataset (Merged_Assembled_Loci.zip) to include only genes found in at least 50% of the samples
"75%-filtered": filtered from the total original dataset (Merged_Assembled_Loci.zip) to include only genes found in at least 75% of the samples
exon: The sequence of DNA that is present in the final, mature messenger RNA transcript
fasta: a text-based format for representing nucleotide sequences
This data was obtained using the exon-capture method of Li et al. (2013). RNA baits were designed using EvolMarkers to target approximately 12,000 exons present in eight model fish genomes. The samples used for gene-capture were obtained from collections available in the Fish Systematics and Conservation Lab at Texas A&M University – Corpus Christi or from collaborators at other institutions. This dataset includes 170 samples, which include 158 species and 130 genera. Samples were sequenced on an Illumina HiSeq 2500 or an Illumina HiSeq 4000 using paired-end 150 sequencing. The Assexon pipeline was used to assemble the raw gene-capture data into coding regions without the flanking regions (Yuan et al., 2019). The assembled sequence files are in fasta format and named according to the corresponding gene in the bait set.
The two files "concat.goby_50pct_JE2.nex.fas" and "concat.goby_75pct_JE2.nex.fas" were further processed by filtering down the number of taxas to only 102 samples used in the final phylogenies for the submitted manuscript. These are the aligned sequence files used in the phylogenies. "concat.goby_50pct_JE2.nex.fas" contains only genes present in at least 50% of the samples, whereas "concat.goby_75pct_JE2.nex.fas" contains genes found in at least 75% of the samples.
This zipped file contains fasta-formatted text files that can be opened in any text editing program. The .nex.fas files are fasta-formated text files that can also be opened in any text-editing program.