Data from: De novo transcriptome assembly databases in the butterfly orchid Phalaenopsis equestris

Name: De novo transcriptome assembly databases in the butterfly orchid Phalaenopsis equestris
Keywords: Phalaenopsis equestris

Niu, Shan-Ce1; Xu, Qing2; Zhang, Guo-Qiang2; Zhang, Yong-Qiang; Tsai, Wen-Chieh3; Hsu, Jui-Ling2; Liang, Chieh-Kai3; Luo, Yi-Bo1; Liu, Zhong-Jian4

Published Sep 14, 2017 on Dryad. https://doi.org/10.5061/dryad.8253q

Data files

Sep 14, 2017 version files 1.26 GB

annotation_dataset4.tar.gz

258.73 MB
CEGs_dataset6.tar

675.84 KB
HSP_dataset5.tar

368.64 KB
pequ_functional_annotation_dataset1.tar

68.26 MB
pequ_gene_models_dataset1.tar

52.94 MB
pequ_repeat_dataset1.tar.gz

187.91 MB
Pha_1213.scafSeq.FG2_superscaffold.tar.gz

306.45 MB
unigene_dataset3.tar

383.47 MB

Abstract

Orchids are renowned for their spectacular flowers and ecological adaptations. After the sequencing of the genome of the tropical epiphytic orchid Phalaenopsis equestris, we combined Illumina HiSeq2000 for RNA-Seq and Trinity for de novo assembly to characterize the transcriptomes for 11 diverse P. equestris tissues representing the root, stem, leaf, flower buds, column, lip, petal, sepal and three developmental stages of seeds. Our aims were to contribute to a better understanding of the molecular mechanisms driving the analysed tissue characteristics and to enrich the available data for P. equestris. Here, we present three databases. The first dataset is the RNA-Seq raw reads, which can be used to execute new experiments with different analysis approaches. The other two datasets allow different types of searches for candidate homologues. The second dataset includes the sets of assembled unigenes and predicted coding sequences and proteins, enabling a sequence-based search. The third dataset consists of the annotation results of the aligned unigenes versus the Nonredundant (Nr) protein database, Kyoto Encyclopaedia of Genes and Genomes (KEGG) and Clusters of Orthologous Groups (COG) databases with low e-values, enabling a name-based search.

P. equestris genome assembly

The P. equestris genome scaffolds and the file containing the locational relationship between the superscaffold and scaffolds or contigs

Pha_1213.scafSeq.FG2_superscaffold.tar.gz

P. equestris genome repeat annotation

The P. equestris genome repeat annotation，which containing repeat annotation file by proteinmasker, repeatmasker and TRF, the gff format file of repeat annotation by proteinmasker, repeatmasker and TRF, the gff format file of de novo repeat annotation and the xlsx format file of the statistics of repeat annotation.

pequ_repeat_dataset1.tar.gz

P. equestris genome gene models

The P. equestris genome gene models contain predicted coding sequence, proteins and gff format file

pequ_gene_models_dataset1.tar

P. equestris genome functional annotation

The P. equestris genome function annotation dataset contains the blast results from KEGG, InterPro, Swissprot, TrEMBL database

pequ_functional_annotation_dataset1.tar

The transcriptome assembly

The dataset contains the unigenes from the longest contigs per transcripts generated by Trinity. The fb.flower bud.Unigene.fa file contains unigenes from flower of P. equestris, the L5.root.Unigene.fa file are unigenes from root of P. equestris, the L6.stem.Unigene.fa file contains unigenes from stem of P. equestris, the PHA.leaf. Unigene.fa file contains unigenes from leaf of P. equestris. 12_day.unigene.fasta, 7_day.unigene.fasta and 4_day.unigene.fasta files are unigenes from seeds respectively taken from sowing on 1/2 MS medium for 12 days, 7 days and 4 days. sepal.unigene.fasta, petal.unigene.fasta, lip.unigene.fasta and column.unigene.fasta files are unigenes from sepal, petal, lip and column.

unigene_dataset3.tar

The transcriptome functional annotation

The dataset contains functional annotation and gene coding sequence annotation for 11tissues. There are five annotation files per tissues, which are three functional annotation files and two structural annotation files, respectively. They are the KEGG, COG and Nr database annotation files. The cds and pep files are fasta format, the title in the files contains unigene name predicted coding sequence, the locus and the coding direction

annotation_dataset4.tar.gz

HSP gene family in the eleven transcriptome

We tested full-length transcripts against the HSP90 and HSP70 gene family in order to examine the completeness of the data by comparing 11 tissues transcriptomes with P. equestris genome. PEQU means P. equestri; flower bud, root, stem and leaf are labeled by fb, L5, L6 and PHA, respectively. 4_day_seed, 7_day_seed and 12_day_seed are seeds respectively taken from sowing on 1/2 MS medium for 4 days, 7 days and 12 days.

HSP_dataset5.tar

100 CEGs for checking transcript assembly completeness

The alignment results from100 randomly selected conserved core eukaryotic genes (CEGs) among Arabidopsis thaliana, P. equestris and eleven transcriptomes for examining the transcript assemblies completeness. 82CEGs sequences (82%) were perfectly reconstructed, showing high consistency, although there were some sequences suggesting that partial sequencing missed in PEQU genome, such as sequences from At2g36880.1 and At1g12840.1 homologous genes, and some sequences in transcriptomes should be merged, such as sequences from At4g39280.1 homologous genes.

CEGs_dataset6.tar