Skip to main content
Dryad

Data from: Transposable element annotation in non-model species - on the benefits of species specific repeat libraries using semi-automated EDTA and DeepTE de novo pipelines

Cite this dataset

Bell, Ellen; Butler, Christopher; Taylor, Martin (2021). Data from: Transposable element annotation in non-model species - on the benefits of species specific repeat libraries using semi-automated EDTA and DeepTE de novo pipelines [Dataset]. Dryad. https://doi.org/10.5061/dryad.m0cfxpp3h

Abstract

Transposable elements (TEs) are significant genomic components which can be detected either through sequence homology against existing databases or de novo, with the latter potentially reducing underestimates of TE abundance. Here, we describe the semi-automated generation of a de-novo TE library which combines the newly described EDTA pipeline and DeepTE classifier in a non-model teleost (Corydoras sp. C115). We assess performance using both genomic and transcriptomic input by five metrics: (i) abundance (ii) composition (iii) fragmentation (iv) age distributions and (v) capture of potential horizontally transferred TEs. We identified notable differences in these metrics between different TE libraries, and highlight how  library choice can have a major impact on TE content estimates in non-model species.

This repository incorporates six raw (unparsed) Repeat Masker (RM) output files for two genomes (Corydoras sp. c115 and Corydoras maculifer) one transcriptome (C. maculifer), two Repeat Libraries (one based on the RepBase Danio rerio library and one de novo library build on the C. sp. c115 genome). The RM ouput files correspond to one homology based transposon search using the D. rerio library and one species specific search using the de novo library. It also includes a script to acompany horizontal transfer analysis and a transposable element renamins script.

Methods

A ‘de-novo’ TE library was generated for the C. sp. C115 genome using the Extensive de-novo TE Annotator (EDTA) (Ou et al., 2019) set to the ‘others’ species parameter. We utilised the inbuilt RepeatModeller (Smit & Hubley, 2008) support which identifies any remaining TEs which might have been overlooked by the EDTA algorithm (--sensitive 1). Classifications within this library were refined using DeepTE using the predefined metazoan model parameter setting (-m) (Yan et al., 2020). TE  identification was performed using RepeatMasker (RM; version 1.332) utilising the NCBI/RMBLAST (version 2.6.0+) search engine. This analysis was conducted either against the D. rerio Repbase (2018-10-26) entry, which was also run through DeepTE (to allow for uniformity in TE classification), or the Corydoras-specific library. RM was run under the most sensitive (-s) parameter setting in all instances. The genomic and transcriptomic RM output files were subsequently parsed through a custom R script which (i) removed non-distinct elements by retaining repeats which had a higher scoring match whose domain partly include the domain of another match, (ii) removed repetitive elements not classed as TEs (e.g. microsatellites, simple repeats & sRNAs), (iii) merged elements found on the same contig if they had the same name, orientation, and their combined sequence length was less than or equal to the corresponding reference sequence in RepBase and (iv) removed merged repeats with a length less than 80 base pairs. Additionally, for transcriptomic data, if multiple identical repeats were found across different transcript isoforms, only one was retained. This was to ensure that each repeat represented a unique genomic locus. This script is publicly available from https://github.com/clbutler/RM_TRIPS." 

Additional scripts describe a horozontal transfer of transposible elements analysis included in the acompanying manuscript. 

Usage notes

Please note that the Repeat Masker output files are raw and unparsed. To parse data as in the manuscript please use the parse script published here: https://github.com/clbutler/RM_TRIPS

File List:

DanioLib_DeepTE_clean.fasta -> The RepBase Danio library which has been run through the DeepTE program for TE classification

Lin1C115wtgdb_EDTADeepTE_cleanLib_V1.1.fasta -> The de novo transposible element library we produced for the Corydoras sp. C115 genome using EDTA and then DeepTE

Horizontal_transfer_Analysis_script.R -> The R script used for the horizontal transfer of transposible elements analysis 

Unparsed_DanioDeepTElib_CmaculiferGenome_AssemblyNameCM_19_scafSeq.fas.out -> Unparsed Repeat Masker output using DanioLib_DeepTE_clean.fasta as the repeat library and the Corydoras maculifer genome (available on genbank)

Unparsed_DanioDeepTElib_CmaculiferSample56_transcriptome.out -> Unparsed Repeat Masker output using DanioLib_DeepTE_clean.fasta as the repeat library and the Corydoras maculifer transcriptome (available on genbank)

Unparsed_DanioDeepTElib_CorydorasC115genome_AssemblyNameLin1PacBio.ctg.fa.r3p3_pilon_3.fasta.out ->  Unparsed Repeat Masker output using DanioLib_DeepTE_clean.fasta as the repeat library and the Corydoras sp. C115 genome (available on genbank)

Unparsed_DeNovolib_CmaculiferGenome_AssemblyNameCM_19_scafSeq.fas.out -> Unparsed Repeat Masker output using Lin1C115wtgdb_EDTADeepTE_cleanLib_V1.1.fasta as the repeat library and the Corydoras maculifer genome (available on genbank)

Unparsed_DeNovolib_CmaculiferSample56_transcriptome.out -> Unparsed Repeat Masker output using Lin1C115wtgdb_EDTADeepTE_cleanLib_V1.1.fasta as the repeat library and the Corydoras maculifer transcriptome (available on genbank)

Unparsed_DeNovolib_CorydorasC115genome_AssemblyNameLin1PacBio.ctg.fa.r3p3_pilon_3.fasta.out -> Unparsed Repeat Masker output using Lin1C115wtgdb_EDTADeepTE_cleanLib_V1.1.fasta as the repeat library and the Corydoras sp. C115 genome (available on genbank)

Unparsed_DanioDeepTElib_DanioGenome_AccessionNoGCF_000002035.6_GRCz11.out -> Unparsed Repeat Masker output using DanioLib_DeepTE_clean.fasta as the repeat library against the Danio rerio genome (Accession number: GCF_000002035.6_GRCz11)

Funding

Biotechnology and Biological Sciences Research Council, Award: BB/R017174/1

Biotechnology and Biological Sciences Research Council, Award: BB/R017174/1

NERC Environmental Bioinformatics Centre, Award: NE/L002582/1