Skip to main content
Dryad

Error, noise and bias in de novo transcriptome assemblies

Cite this dataset

Freedman, Adam; Clamp, Michele; Sackton, Timothy (2020). Error, noise and bias in de novo transcriptome assemblies [Dataset]. Dryad. https://doi.org/10.5061/dryad.mcvdncjx9

Abstract

De novo transcriptome assembly is a powerful tool, widely used over the last decade for making evolutionary inferences. However, it relies on two implicit assumptions: that the assembled transcriptome is an unbiased representation of the underlying expressed transcriptome, and that expression estimates from the assembly are good, if noisy approximations of the relative abundance of expressed transcripts. Using publicly available data for model organisms, we demonstrate that, across assembly algorithms and data sets, these assumptions are consistently violated. Bias exists at the nucleotide level, with genotyping error rates ranging from 30-83%. As a result, diversity is underestimated in transcriptome assemblies, with consistent under-estimation of heterozygosity in all but the most inbred samples. Even at the gene level, expression estimates show wide deviations from map-to-reference estimates, and positive bias at lower expression levels. Standard filtering of transcriptome assemblies improves the robustness of gene expression estimates but leads to the loss of a meaningful number of protein-coding genes, including many that are highly expressed. We demonstrate a computational method, length-rescaled CPM, to partly alleviate noise and bias in expression estimates. Researchers should consider ways to minimize the impact of bias in transcriptome assemblies.

Methods

Publicly available paired-end RNA seq data were used to perform de novo transcriptome assemblies using Trinity, Shannon and BinPacker assemblers. Gene expression estimates were generated in various ways with RSEM and kallisto. SuperTranscripts were assembled from the assemblies, and genotyping off of these assemblies was performed with GATK. To benchmark these genotypes, we defined "true" genotypes as those generated my aligning RNA-seq reads directly to the Mus genome. For details concerning analysis pipelines, please see the methods section of the manuscript, related supplementary information and the github repository documenting methods and command lines in detail: https://github.com/harvardinformatics/TranscriptomeAssemblyEvaluation .

Usage notes

For details concerning analysis pipelines, and specifics regarding data processing, please see the methods section of the manuscript, related supplementary information and the github repository documenting methods and command lines in detail: https://github.com/harvardinformatics/TranscriptomeAssemblyEvaluation .