Transcriptome Shotgun Sequencing (RNA-seq) has been readily embraced by geneticists and molecular ecologists alike. As with all high-throughput technologies, it is critical to understand which analytic strategies are best suited and which parameters may bias the interpretation of the data. Here we use a comprehensive simulation approach to explore how various features of the transcriptome (complexity, degree of polymorphism π, alternative splicing), technological processing (sequencing error ε, library normalization) and bioinformatic workflow (de novo vs. mapping assembly, reference genome quality) impact transcriptome quality and inference of differential gene expression (DE). We find that transcriptome assembly and gene expression profiling (edgeR vs. baySeq software) works well even in the absence of a reference genome, and is robust across a broad range of parameters. We advise against library normalization, and in most situations advocate mapping assemblies to an annotated genome of a divergent sister clade, which generally outperformed de novo assembly (Trans-Abyss, Trinity, SOAPdenovo-Trans). Transcriptome complexity (size, paralogs, alternative splicing isoforms) negatively affected the assembly and DE profiling, whereas the effects of sequencing error and polymorphism were almost negligible. Finally, we highlight the challenge of gene name assignment for de novo assemblies, the importance of mapping strategies, and raise awareness of challenges associated with the quality of reference genomes. Overall, our results have significant practical and methodological implications, and can provide guidance in the design and analysis of RNA-seq experiments, particularly for organisms where genomic background information is lacking.

Polishing the reference genome

In the first step, Ns in the reference genome were substituted by one of the four bases randomly using perl's built in random number generator.

1_referencepolish.pl.pl

Simulate raw sequencing data - Step 1

In the second step, the reference genome is used along with expression profiles (exponential or uniform) to create fastq files using the program dwgsim version 0.1.2.

2a_simulate_reads_expression.pl

Simulate raw sequencing data - Step 2

The 2b_gnrCounts.R script generates per gene and per library expression levels for 2x10 libraries with differential expression among them. These read count files are used along with the fastq files simulated in previous step by 2c_drawreads.pl to create 2x10 libraries with differential expression among them.

2b_gnrCounts.R

Simulate raw sequencing data - Step 3

2c_drawreads.pl

De novo assembly

In the next step, De novo assemblies need to be performed as per the computational infrastructure available. We provide an example commandline used by us.

3_denovo.sh

Mapping assembly

The mapping assembly first requires mapping the reads onto the reference using stampy read mapper. The sam/bam file obtained after the mapping step is sorted and used to call the consensus sequence. Example commandline used by us can be modified suitably.

4_mapping.sh

Compare de novo assemblies

The quality of de novo assemblies can be evaluated using the scripts available on the GAGA website when alternative splicing is not being simulated. Step 5 consists of 7 sub-steps, each with its own script.

5_compare_denovo.sh

Compare de novo assemblies - Steps 1 to 7

5_substeps.zip

Compare mapping assemblies - Mapping 1

The quality of the mapping assemblies obtained after the consensus step is be evaluated.

6a_compare_mapping1.pl

Compare mapping assemblies - Mapping 2

The quality of the mapping assemblies obtained after the consensus step is be evaluated.

6b_compare_mapping2.pl

Differential expression analysis - baySeq

Differential expression analysis across the 2x10 libraries is performed using the Bioconductor package baySeq.

7a_bayseq.R

Differential expression analysis - edgeR

Differential expression analysis across the 2x10 libraries is performed using the Bioconductor packages edgeR

7b_edgeR.R

Evaluation of differential expression analysis

This script performs some evaluation of the output from the differential expression analyses, e.g. it generates FPR's and TPR's.

8_DEevaluation.R

Create data for zebra finch.

Steps 1 and 2 are accomplished by running 1_2_create_datasets.sh for the zebra finch dataset.

1_2_create_datasets.sh

ZebraFinchData

Zebra finch CDS downloaded from Biomart (Ensembl Version 61) with 100 bp 5' UTR and 400 bp 3' UTR per gene.

Distributions

Simuated read distributions per expression category.

distr.zip

Data from: Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA-seq experiments.

Data files

Abstract

Polishing the reference genome

Simulate raw sequencing data - Step 1

Simulate raw sequencing data - Step 2

Simulate raw sequencing data - Step 3

De novo assembly

Mapping assembly

Compare de novo assemblies

Compare de novo assemblies - Steps 1 to 7

Compare mapping assemblies - Mapping 1

Compare mapping assemblies - Mapping 2

Differential expression analysis - baySeq

Differential expression analysis - edgeR

Evaluation of differential expression analysis

Create data for zebra finch.

ZebraFinchData

Distributions

Data from: Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA-seq experiments.

Data files

Abstract

Usage notes

Polishing the reference genome

Simulate raw sequencing data - Step 1

Simulate raw sequencing data - Step 2

Simulate raw sequencing data - Step 3

De novo assembly

Mapping assembly

Compare de novo assemblies

Compare de novo assemblies - Steps 1 to 7

Compare mapping assemblies - Mapping 1

Compare mapping assemblies - Mapping 2

Differential expression analysis - baySeq

Differential expression analysis - edgeR

Evaluation of differential expression analysis

Create data for zebra finch.

ZebraFinchData

Distributions

Works referencing this dataset