Data from: Fast turnover of genome transcription across evolutionary time exposes entire non-coding DNA to de novo gene emergence

Neme, Rafik1; Tautz, Diethard1

Published Feb 12, 2016 on Dryad. https://doi.org/10.5061/dryad.8jb83

Data files

Feb 12, 2016 version files 987.13 MB

200b_win.features.genomes.norm.ind.zip

169.18 MB
pop.200b.windows.zip

267.87 MB
rarefaction.200b.windows.zip

350.10 MB
resample.200b.windows.zip

193.54 MB
TaxonomicRestrictedExpressionWindows.zip

6.44 MB

Abstract

Deep sequencing analyses have shown that a large fraction of genomes is transcribed, but the significance of this transcription is much debated. Here, we characterize the phylogenetic turnover of poly-adenylated transcripts in a comprehensive sampling of taxa of the mouse (genus Mus), spanning a phylogenetic distance of 10 Myr. Using deep RNA sequencing we find that at a given sequencing depth transcriptome coverage becomes saturated within a taxon, but keeps extending when compared between taxa, even at this very shallow phylogenetic level. Our data show a high turnover of transcriptional states between taxa and that no major transcript-free islands exist across evolutionary time. This suggests that the entire genome can be transcribed into poly-adenylated RNA when viewed at an evolutionary time scale. We conclude that any part of the non-coding genome can potentially become subject to evolutionary functionalization via de novo gene evolution within relatively short evolutionary time spans.

Genome coverage of the mm10 mouse reference genome of four closely related species

Genome coverage of the mm10 mouse reference genome. Computed from genomic alignments of Apodemus uralensis, Mus mattheyi, Mus spicilegus, and Mus spretus. The first six fields correspond to the genomic location and the following four to each of the species mentioned here, in the same order. Features were generated with bedtools, converted into SAF format, and extracted from BAM alignments using the featureCounts suite.

200b_win.features.genomes.norm.ind.zip

Transcriptome coverage of the mm10 mouse reference genome across 200bp windows in ten closely related taxa and three tissues.

Transcriptome coverage of the mm10 mouse reference genome. Computed from transcriptome alignments of three populations of Mus musculus domesticus (AH from Iran, CB from Germany, MC from France), two populations of Mus musculus musculus (KH from Kazakhstan, WI from Austria), Mus musculus castaneus (TA), Mus spicilegus (SC), Mus spretus (SP), Mus mattheyi (MA) and Apodemus uralensis (AP). The first six fields correspond to the genomic location and the following each of the transcriptomes of the species mentioned here. Brain samples (pbrain), liver samples (pliver) and testis samples (ptestis), correspond to sequencing done at approximately one third of an Illumina HiSeq 2000 lane per taxon, while additional brain samples (xbrain) were done in a whole illumina HiSeq 2000 lane per taxon. Features were generated with bedtools, converted into SAF format, and extracted from BAM alignments using the featureCounts suite.

pop.200b.windows.zip

rarefaction.200b.windows

Transcriptome coverage of the mm10 mouse reference genome. Computed from transcriptome alignments of three populations of Mus musculus domesticus (AH from Iran, CB from Germany, MC from France), two populations of Mus musculus musculus (KH from Kazakhstan, WI from Austria), Mus musculus castaneus (TA), Mus spicilegus (SC), Mus spretus (SP), Mus mattheyi (MA) and Apodemus uralensis (AP). This is a summary over three tissues (brain, liver, testis) for each of the taxa, resampled to obtain coverage rarefaction estimates by taxon and by fraction of data sequenced. Number in columns indicates the percentage, with total representing the maximum available sampling for each taxon. Each of the rows on this file corresponds to the rows of the transcriptome file, present together in this submission, and must be analyzed together to obtain genomic position information. Features were generated with bedtools, converted into SAF format, and extracted from BAM alignments using the featureCounts suite.

resampling_brain.200b.windows

Alignments of extensive sequencing of Brain samples (~320 million reads) were split into three different sets of 100 million reads per taxon, such that each set would contain sets of independent observations. Pair-relationships were maintained, so that pairs of the same fragments would be in the same set. Here we report the quantification per window of each of those resampled transcriptome sets. Coverage of the mm10 mouse reference genome. Computed from transcriptome alignments of three populations of Mus musculus domesticus (AH from Iran, CB from Germany, MC from France), two populations of Mus musculus musculus (KH from Kazakhstan, WI from Austria), Mus musculus castaneus (TA), Mus spicilegus (SC), Mus spretus (SP), Mus mattheyi (MA) and Apodemus uralensis (AP). The first six fields correspond to the genomic location and the following each of the transcriptomes of the species mentioned here. Features were generated with bedtools, converted into SAF format, and extracted from BAM alignments using the featureCounts suite.

resample.200b.windows.zip

TaxonomicRestrictedExpressionWindows

Multiple files corresponding to windows with expression above 50 reads in one taxon and absent in all others. Most regions represent only a single taxon, with the exception of those defined for Mus musculus musculus and Mus musculus domesticus populations, in which windows could be present in at least one population, but could also be present in more than one population, provided they would be absent in any other regions. Taxon codes as indicated in the main body of the manuscript. Tissue samples correspond to brain (B), liver (L), and testis (T), and to additional extensive sequencing of brain samples (UDS). Files are in bigWig format, for visualization together with the mm10 version of the mouse reference genome. We provide two IGV (Integrative Genomics Viewer) sessions XML files, which the user can directly load onto the genome browser. One session uses local files, and files have to be present in the same directory as the session file, another makes use of existing files in our local ftp server and does not require the local files, but does require internet connection. In addition to this we provide the expression values supporting the taxonomically-restricted status (*.dat), and a bed file of the relevant regions.