Taimen (Hucho taimen) is an important ecological and economic species that is classified as vulnerable by the IUCN Red List of Threatened Species; however, limited genomic information is available on this species. RNA-Seq is a useful tool for obtaining genetic information and developing genetic markers for non-model species in addition to its application in gene expression profiling. In this study, we performed a comprehensive RNA-Seq analysis of taimen. We obtained 157 M clean reads (14.7 Gb) and used them to de novo assemble a high-quality transcriptome with a N50 size of 1060 bp. In the assembly, 82% of the transcripts were annotated using several databases, and 14,666 of the transcripts contained a full open reading frame. The assembly covered 75% of the transcripts of Atlantic salmon and 57.3% of the protein-coding genes of rainbow trout. To learn about the genome evolution, we performed a systematic comparative analysis across 11 teleosts including 8 salmonids, and found 313 unique gene families in taimen. Using Atlantic salmon and rainbow trout transcriptomes as the background, we identified 250 positive selection transcripts. The pathway enrichment analysis revealed a unique characteristic of taimen: it possesses more immune-related genes than Atlantic salmon and rainbow trout; moreover, some genes have undergone strong positive selection. We also developed a pipeline for identifying microsatellite marker genotypes in samples, and successfully identified 24 polymorphic microsatellite markers for taimen. These data and tools are useful for studying conservation genetics, phylogenetics, evolution among salmonids and selective breeding for threatened taimen.

Trinotate annotation

The taimen transcirptome was annotated using Trinotate(https://trinotate.github.io/) according to the guidance. NR, Uniprot-Sprot and Pfam databases were used.

Trinotate.tsv.zip

Interproscan annotation

Interproscan annotation for taimen transcriptome. 79,800 transcripts were annotated using Interproscan.

Interproscan.tsv.zip

Gene Ontology annotation

The sequences with significant hits in the Uniprot database or Pfam database were assigned GO terms using the Trinotate package, and the GO terms were assigned using Interproscan.72,728 transcripts were assigned to 15,107 GO terms, including 10,185 biological process terms, 1,429 cellular component terms and 3,493 molecular function terms.

GO.txt.zip

KEGG annotation

A KEGG pathway analysis was performed using GhostKOALA . A total of 51,698 transcripts were assigned to 8,052 KEGG ortholog groups

KEGG.txt.zip

eggNOG annotation

The COG functional category annotation using eggNOG-mapper (Huerta-Cepas et al., 2017), 72,605 putative proteins were annotated.

eggNOG-mapper.tsv.zip

Assembly transcriptome

The transcriptome sequences were assembled using the Trinity package. Before assembly, low-quality reads were filtered from the raw reads using Trimmomatic with the parameters LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:50. The clean reads from the two pooled libraries were merged and in silico normalized using the Trinity package with default parameters to reduce the running time and memory consumption. A parameter kmer size of 25 and a depth of at least two kmer were used for assembly with the Trinity package. The contigs resulting from Trinity were further fed to the TGI clustering Tool (version 2.1) to process alternative splicing and redundant sequences.The raw RNA-Seq reads and assembled transcripts were deposited in the European Nucleotide Archive under the project ID PRJEB19675 and accession numbers HAGJ01000001 to HAGJ01190473 for the assembled transcripts.

transcriptome.embl.dat.zip

SNP.vcf

Clean reads were firstly mapped to transcripts using Bowtie2, then SNPs were called SNPs using SAMtools. Raw SNPs with a minimum depth of 4 and minimum quality of 20 were filtered out using Vcftools (Danecek et al., 2011), and SNPs clustered within 50 bp were also filtered out. SNPs were annotated using snpEFF(http://snpeff.sourceforge.net/)

ORF prediction

TransDecoder (https://transdecoder.github.io/) was used to predict the open reading frames (ORFs) and translate proteins, and homology searches against pFam and Uniprot databases were performed as supporting evidence for the ORFs. The ORFs with fewer than 30 amino acids were discarded.

TransDecoder.zip

Microsatellite Primers

Sputnik software was used to search di-, tri-, tetra-, pena- and hex-nucleotide motif SSRs. Primers were designed using the Primer3 package.After aligning the amplicon sequences to the Atlantic salmon genome with BLAT, primers were chose because their amplicons were located in the genome of Atlantic salmon with identities >70% and spanned distances close to the length of the amplicons.

SSR.primers.txt.zip

Sequences of index and primers

This pack contains 4 files, "forward_index.txt" and "reverse_index" are index sequences for demultiplexing reads to samples, and "primers.txt" was primer sequences for classifying reads to loci, and "sample_config.txt" is index config for samples. These files were used to genotype 32 taimen samples which collected from the Hutou section of the Wusuli River (E133˚40´17″, N45˚58´50˝) . The raw reads sequenced with Illumina HiSeq2500 platform in 250 Pair-End mode were deposited to in the European Nucleotide Archive under the project ID PRJEB19675 with accession number ERR2029723.

Index_and_primers_for_genotype.zip

Pipeline for characterizing polymorphism and defining genotype of microsatellite markers

This pack contains DeMultiIndex binary files, SSRGeno binary files(Linux 64 bit and MacOS 64bit system), an R script for drawing allele depth barplot and a manual document.

microsatellite_pipeline.zip

Data from: De novo assembly and characterization of the Hucho taimen transcriptome

Data files

Abstract

Trinotate annotation

Interproscan annotation

Gene Ontology annotation

KEGG annotation

eggNOG annotation

Assembly transcriptome

SNP.vcf

ORF prediction

Microsatellite Primers

Sequences of index and primers

Pipeline for characterizing polymorphism and defining genotype of microsatellite markers

Data from: De novo assembly and characterization of the Hucho taimen transcriptome

Data files

Abstract

Usage notes

Trinotate annotation

Interproscan annotation

Gene Ontology annotation

KEGG annotation

eggNOG annotation

Assembly transcriptome

SNP.vcf

ORF prediction

Microsatellite Primers

Sequences of index and primers

Pipeline for characterizing polymorphism and defining genotype of microsatellite markers

Works referencing this dataset