Data from: Contrasting distributions and expression characteristics of transcribing repeats in Setaria viridis
Data files
Jan 27, 2025 version files 31.25 GB
-
allrepeats.fasta
6.86 MB
-
Annotation_LTR.gff3
54.35 MB
-
README.md
4.91 KB
-
Setaria_viridis_Crown_bam.zip
1.30 GB
-
Setaria_viridis_Inflorescence_bam.zip
34.04 MB
-
Setaria_viridis_Leaf_bam.zip
27.57 GB
-
Setaria_viridis_Stem_bam.zip
2.28 GB
-
Table_S1.xlsx
11.08 KB
Abstract
Repetitive DNA contributes significantly to plant genome size, adaptation, and evolution. However, little is understood about the transcription of repeats. This is addressed here in the plant green foxtail millet (Setaria viridis). First, we used RepeatExplorer2 to calculate the genome proportion (GP) of all repeat types and compared the GP of LTR retroelements against annotated complete and incomplete LTR retroelements (Ty1/copia, Ty3/gypsy) identified by DANTE in a whole genome assembly. We show that DANTE-identified LTR retroelements can comprise ~0.75% of the inflorescence poly-A transcriptome and ~0.24% of the stem ribo-depleted transcriptome. In the RNA libraries from inflorescence tissue, both LTR retroelements and DNA transposons identified by RepeatExplorer2 were highly abundant, where they may be taking advantage of the reduced epigenetic silencing in the germ line to amplify. Typically, there was a higher representation of DANTE-identified LTR retroelements in the transcriptome than RepeatExplorer2-identified LTR retroelements, potentially reflecting the transcription of elements that have insufficient genomic copy numbers to be detected by RepeatExplorer2. In contrast, for ribo-depleted libraries of stem tissues the reverse was observed, with a higher transcriptome representation of RepeatExplorer2-identifed LTR retroelements. For RepeatExplorer2-identified repeats, we show that the GP of most Ty1/copia and Ty3/gypsy families were positively correlated with their transcript proportion. In addition, GC-rich repeats with high sequence similarity were also the most abundant in the transcriptome, and these likely represent young elements that are most capable of amplification due to their ability to evade epigenetic silencing.
README: Contrasting distributions and expression characteristics of transcribing repeats in Setaria viridis
https://doi.org/10.5061/dryad.rjdfn2znh
Contributors:
Ana Luiza Franco: Federal University of Juiz de Fora, Institute of Biological Sciences, Brazil.
Wenjia Gu: Queen Mary University of London, UK.
Petr Novák: Biology Centre, Czech Academy of Sciences, Czech Republic.
Ilia J. Leitch: Royal Botanic Gardens, Kew, UK.
Lyderson F. Viccini: Federal University of Juiz de Fora, Institute of Biological Sciences, Brazil.
Andrew R. Leitch: Queen Mary University of London, UK.
Description of the data and file structure
Repeat Transcription Analysis Using RepeatExplorer2
Objective
This study explores the feasibility of analyzing repeat transcription using genome skimming data, enabling application across organisms regardless of genome assembly availability.
RepeatExplorer2
- Builds a library of genomic repeats (e.g., LTR retroelements, DNA transposons, satellites).
- Identifies consensus repeat sequences but does not distinguish between complete and incomplete repeats.
- Defines repeats as sequences with ≥90% similarity spanning ≥55% of Illumina read lengths (Novák et al., 2020).
-Data Source: Illumina short-read sequences for S. viridis (Mamidi et al., 2020).
Process
1. Repeat Identification:
- Detects repeats with ~200 or more copies.
- Summarizes repeat content in the S. viridis genome.
2. Mapping:
- RNA-seq reads from different tissues of S.viridis mapped to repeat contigs (softwares: Bowtie2 and Samtools).
- Comparisons made with DANTE-annotated LTR retroelements from assembled genomes.
3. Characterization:
- Evaluated repeat transcript proportions (TP) of different repeat classes.
Files and variables
File: allrepeats.fasta
Description: Contigs of repeats generated from raw reads of Setaria viridis using RepeatExplorer2
File: Annotation_LTR.gff3
Description: Annotation file generated using the DANTE software from the whole genome of Setaria viridis
File: Table_S1
Description: NCBI accession numbers and sources of DNA and RNA sequence read datasets of Setaria viridis ‘A10’ used in this study. The original sources of the data are given in the column ‘Reference’. Descriptions of the library (taken from the source papers), the library layout (single- or paired-end reads), the library type (genomic, ribo-depleted or poly-A), the tissue source of the library and the numbers of reads analysed are shown.
File: Pipeline_TPG.pdf
Description: Pipelines used to quantify and characterize repeats in the genome and transcriptomes of Setaria viridis.
File: Setaria_viridis_bam.zip
Description: BAM files are the mapping outputs of contigs (from allrepeats.fasta) aligned against RNA-Seq data from various tissues, including leaf, crown, stem, and inflorescence.
Code/software
extract_read_IDs_from_BLAST_report_for_hits_of_given_overlap_and_identity_v7
Software: blast+
Get_reads_similarity.sh
Creat a custom BLAST database from the reads file from a cluster and run BLAST, searching the reads against themselves
Total_Script_Mitocondria_reads.sh
-Select clusters that contain any percentage of mitochondrial or plastid sequences.
-Open the file "dna_database_annotation" in the RepeatExplorer output to identify reads originating from organelles.
-Locate these reads in the "reads.fasta" file and extract them.
Softwares: SAMtools and Bowtie2
Map_contigs_to_RNAseq_reads.sh
For each RNA-Seq library, Trimmomatic is used to perform trimming. Bowtie2 is then employed to map the repeat contigs against the RNA-Seq reads. From the resulting .bam files, only the mapped reads are filtered using SAMtools. Finally, the number of reads in each cluster previously annotated in RepeatExplorer is counted (file${TRANSLIB}.txt).
Softwares: Trimmomatic, FastQC, SAMtools and Bowtie2
Script_mapping_to_genome.sh
The RNA-Seq reads are mapped against the whole genome of Setaria viridis. Using HTSeq-count, the number of reads mapped to an annotation file previously generated by the DANTE software is quantified
DESeq2_Repeats_2024
R script used to run DESeq2 for analyzing the expression of repeats detected by the DANTE software.
Total_Rscripts_TPG
R script used for generating graphs and linear models in this study
Access information
Other publicly accessible locations of the data:
- All scripts are stored in github (https://github.com/ana-franco-bio/Repeats_transcriptome).
- All raw sequencing files generated in this study are available on NCBI Sequence Read Archive (SRA) under BioProject accession number PRJNA1020582.
Methods
Datasets used in the analysis
The DNA sequences used to characterize genomic repeats with RepeatExplorer2 were downloaded from NCBI (See Table 1) and comprised paired-end Illumina NovaSeq 6000 sequence data with reads of 151 bp length from genomic DNA of Setaria viridis ‘A10’ (SRR10051273) (Mamidi et al., 2020).
For the RNA sequencing conducted here, Setaria viridis ‘A10’ seeds were washed with distilled water and sterilized by incubating in a solution of 20% sodium hypochlorite and 0.1% Tween 20. Seeds were germinated in petri dishes in MS medium for 5 days, followed by 7 days in plastic pots. After root system establishment, the seedlings were transferred to pots containing a mixture of substrate and sand (1:1) and grown in a plant chamber in the Federal University of Juiz de Fora (Brazil), under a photoperiod of 16/8h (light/dark), temperature 25°+/- 2°C for at least 20 days. RNA from c. 100 mg of leaf material was extracted with the RNeasy Plant Mini Kit, (74904 Qiagen).
Ribo-depleted RNA libraries were prepared for direct sequencing using Invitrogen’s RiboMinus™ Plant Kit for RNA-Seq (cat. A10830-08) which uses labelled oligos against ribosomal sequences to deplete unwanted ribosomal RNA (rRNA) transcripts. Isolated RNA was purified and concentrated using the MonarchRM RNA Cleanup Kit. Poly-A RNA libraries were prepared for direct sequencing using New England Biolab’s NEBNext® poly(A) mRNA magnetic isolation module (NEB E7490; 7 min. fragmentation time). Direct sequencing of the ribo-depleted and poly-A libraries, each with three biological replicates, were performed using Illumina NovaSeq 6000 (Genomic Centre, Queen Mary University of London), generating 150 bp paired-end reads.
Further RNA libraries were downloaded from NCBI sourced from two previously published experiments, each with three biological replicates (see Table S1 for details). These were: (i) paired-end Illumina HiSeq 2000 sequence data of ribo-depleted transcriptome libraries of S. viridis ‘A10’ with read lengths of 101 bp. We analysed the data generated from stem tissue (the region where new tillers are produced) and crown tissue (the region where new roots are produced), sampled nine days after sowing. The libraries and the experimental conditions are described in Sebastian et al. (2016); and (ii) single-end Illumina HiSeq 2500 sequence data from poly-A transcriptome libraries with reads lengths of 100 bp prepared from inflorescence primordia of wild-type material sampled fifteen days after sowing. Details are described in Yang et al. (2018)
Analysis of repeats in the genome using RepeatExplorer2
All reads from the sequencing conducted here had adapter sequences removed using Trimmomatic v.0.39 before analysis. Sequence read quality control was evaluated using FastQC. Reads corresponding to 0.4% of the genome which had passed the quality control threshold (Phred score > 33) were trimmed to 151 bp in Trimmomatic v.0.39 (Bolger et al., 2014). Paired-end reads from Sebastian et al. (2016, 101 bp) and single-end reads from Yang et al. (2018, 100 bp) were used.
Repeats were characterized using the RepeatExplorer2 pipeline (Novák et al., 2010; Novák, Neumann, et al., 2020), implemented on the Galaxy server (https://repeatexplorer-elixir.cerit-sc.cz). Briefly, using an all-to-all BLAST analysis, RepeatExplorer2 (pipeline version: 0.3.8-451(9d65fb1)) clusters reads that are at least 90% similar over at least 55% of the sequence length to identify, quantify and de novo annotate repeats. In addition, based on protein domains (protein database Viridiplantae v2.2.fasta) and other DNA databases including those in RepeatExplorer2, we identified known TEs. TEs were classified according to the REXdb classification system (Neumann et al., 2019).
The genome proportion (GP) of each repeat cluster containing more than 150 reads (termed here ‘repeat top clusters’) was calculated as the total number of reads in the repeat cluster divided by the total number of reads analysed (excluding reads from clusters identified as being derived from mitochondria or plastids).
Characterizing sequence similarity and GC content of repeats in the genome
The reads of each cluster generated by RepeatExplorer2 were used in all-to-all BLAST searches to generate pairwise similarity scores. The frequency distribution of the pairwise similarity scores were used to estimate the repeat similarity scores (= modal value of the scores for reads with ≥80% similarity in each cluster), using a custom Perl script. The Perl script was used to filter out self-hits and reciprocal hits from the BLAST results. The final modal value for each cluster and the linear models were calculated in R Studio <http://www.rstudio.com/> (R Team, 2020). The GC content of all reads in each of the repeat top clusters was estimated using an adapted Python script (Meneghin, 2009).
Pipeline for detecting expressed repeats in the transcriptome
The RNA-seq data (see Table S1) were evaluated using FastQC and those that passed the quality control threshold (Phred score > 33) were trimmed to 151 bp in Trimmomatic v.0.39 to remove Illumina adaptors (Bolger et al., 2014). The number of reads in each library which passed thresholds for quality after trimming is given in Table S1.
The pipeline used to identify and quantify reads in the transcriptome using RepeatExplorer2 is summarized in Figure S6. RepeatExplorer2 generates a library of contigs that are consensus sequences for the genomic repeats that comprise each repeat cluster. The files “contigs.fasta” for each cluster from RepeatExplorer2 were merged to produce a library of repeat contigs for mapping. This repeat contig library was used as the reference to map reads from each of the transcriptome libraries using Bowtie2 (Langmead & Salzberg, 2012). The threshold for mapping was 90% similarity over at least 55% of the sequence length. By default, Bowtie2 performs end-to-end read alignment. Samtools (Li et al., 2009) was used to save the output files (.bam and .sam format), and the total number of transcript reads which mapped once to any of the contigs in a cluster was calculated and used for further analysis. This research utilised Queen Mary’s Apocrita HPC facility, supported by QMUL Research-IT http://doi.org/10.5281/zenodo.43804.
Calculating the transcript proportion of repeats in the transcriptome
To calculate the transcript proportion (TP) of each repeat, the total number of mapped transcript reads (RNA-seq) was mapped to the RepeatExplorer2 contigs derived from the DNA sequence reads (genome-sequencing), excluding any RNA reads that mapped to repeat clusters comprising ribosomal DNA (rDNA) or organelle sequences. Some repeat clusters contained a proportion of DNA sequences of organellar origin. Whilst some reads might represent examples of the natural integration of organellar DNA into the nuclear genome, we nevertheless removed them as they are not relevant to this analysis. Some cluster reads with a proportion of tRNA sequences were also removed. The RNA-seq reads which mapped to repeat clusters were used to calculate the TP of repeats in the ribo-depleted and poly-A RNA libraries. Repeats were classified at the levels of repeat superfamily and individual repeat lineages within these superfamilies. Finally, the total number of mapped RNA-seq reads for each repeat type (i.e. repeat superfamily, e.g. all Ty3/gypsy elements) or individual repeat lineages, e.g. Athila elements) were summed together and divided by the total number of RNA-seq reads analysed in the library (excluding rRNA and organelle reads), and expressed as a percentage.
Mapping repeats and transcribed repeats to the Setaria viridis whole genome assembly
To compare the results obtained using the genome skimming data (see above, = RepeatExplorer-identified repeats) with those using a whole genome assembly as input to characterize the repeats (see below, = DANTE-identified LTR retroelements), we took advantage of the platinum grade chromosome-level whole genome assembly available for S. viridis ‘A10’ (Genebank: GCF_005286985.1) (Mamidi et al. 2020).
First, DNA sequence reads from the genome assembly were used as input to identify and annotate LTR retroelements (LTR-RT) using the DANTE v0.1.8 (https://doi.org/10.5281/zenodo.8183566) and DANTE_LTR v0.3.5 pipeline (https://doi.org/10.5281/zenodo.10213785) available on the RepeatExplorer Galaxy server (DOI: 10.1038/s41596-020-0400-y).
The sequences of the identified LTR-RT elements were used to create a custom library of LTR-RT elements using “he "dante_ltr_to_lib”ary" script from the DANTE_LTR repository and used as a library for RepeatMasker search to annotate the LTR retroelements using a similarity-based approach. The RepeatMasker search was performed on the RepeatExplorer Galaxy server with opti“ns "-xsmall -no_is -e ”cbi". The output enables both complete and incomplete LTR retroelements to be identified based on the presence or absence of the following three components: long terminal repeats (LTRs), primer binding site (PBS) and target site duplication (TSD). Complete elements had all three components, while incomplete elements lacked at least one of these. The GFF file following annotation served as a reference for counting reads in the bam files using HTseqCount. We filtered out repeats with at least 100 mapped reads and inspected each location in the genome assembly using integrative genomic viewer (IGV) to determine their proximity to genes or their location within genes, e.g. within introns. Additionally, we identified specific repeats and investigated those that were differentially expressed with Deseq2 (RPM normalization). In the heatmaps, transcript abundance was normalized by row.
In addition, we mapped the repeat RNA-seq reads identified in the poly-A and ribo-depleted RNA libraries to the whole genome assembly using Bowtie2 (Dobin et al., 2013). The .bam files were converted to .bed and .gff3 files using bedtools (Quinlan & Hall, 2010) and genometools (Gremme et al., 2013).
Linear regression models
Multiple linear regression models were tested to analyse the relationships between the TP of each repeat and the following factors: GP, GC content, sequence similarity (%), tissue type, experimental condition and repeat type. In all models, we analysed three replicates. To meet statistical assumptions inherent to the models, some variables were log-transformed to generate a normal distribution of values. In these cases, clusters that had a zero TP value (y axis) were not included. All linear regressions and statistical analyses were performed using R version 4.0.2 (R Team, 2020). Associated figures were generated using RStudio and Photoshop CC version 2012.0.1 (Adobe Systems).
All scripts are stored in github (https://github.com/ana-franco-bio/Repeats_transcriptome).