Transposable element libraries from 101 fish
Data files
Sep 28, 2023 version files 84.80 MB
Abstract
Repetitive DNA make up a considerable fraction of most eukaryotic genomes. In fish, transposable element (TE) activity has coincided with rapid species diversification. Here, we annotated the repetitive content in 100 genome assemblies, covering the major branches of the diverse lineage of teleost fish. We investigated if TE content correlates with family level net diversification rates and found support for a weak negative correlation. Further, we found that TE proportion correlate to genome size, but not to the proportion of short tandem repeats (STRs), which implies independent evolutionary paths. Marine and freshwater fish have large differences in STR content. The most extreme propagation was found in the genomes of codfish species and Atlantic herring. Such a high density of STRs is likely to increase the mutational load, which we propose could be counterbalanced by high fecundity as seen in codfishes and herring.
README: Teleost TE libraries
This repository contain de novo libraries of transposable element (TE) consensus sequences (FASTA file), one per genome assembly. The results of masking each genome assembly with RepeatMasker using these de novo libraries can be found at https://doi.org/10.6084/m9.figshare.8280800.
Description of the data and file structure
Each file is one TE library (FASTA file) that can be used to mask a genome assembly. To generate the libraries, we used a variant of the computational pipeline that is more thoroughly described in (Trresen et al. 2017), available at https://github.com/uio-cels/Repeats. The pipeline includes multiple TE detection steps using different tools, steps for removing non-TEs from the detected sequences and steps for classifying the elements. For the initial detection step, we used RepeatModeler (v. 1.0.8) (Smit & Hubley 2008-2015) and LTRharvest (part of GenomeTools v. 1.5.7) (Ellinghaus et al. 2008). RepeatModeler detects all sorts of repetitive sequences and LTRharvest is specialized for detecting LTR-RTs. Using BLASTX, TEs with sequences matching known non-TEs in UniProtKB/Swiss-Prot were removed. To classify the TEs, we used RepeatClassifier, which is a part of the RepeatModeler software. As the tool did not manage to classify all of the remaining sequences, additional similarity searches were performed between the sequences and a curated library of TE sequences (RepBase v. 20150807), using nucleotide BLAST. Finally, we built Hidden Markov Model profiles from the detected sequences using HMMER (v. 3.1b1) (Wheeler & Eddy 2013) and compared the profiles with HMM profiles from databases downloaded from GyDB.org (Llorens et al. 2011) and dfam.org (Hubley et al. 2016), using the nhmmer feature included in HMMER. This resulted in additional sequences being classified at the class and subclass level. The pipeline resulted in one de novo library per assembly, which contained the consensus sequences of the interspersed repeats detected in each assembly.
Sharing/Access information
The source genome assemblies used to generate the consensus libraries was retrieved from the following sources:
SOURCE SPECIES
Malmstrom et al. 2017 Acanthochaenus luetkenii
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000751415.1_Midas_v5/GCA_000751415.1_Midas_v5_genomic.fna.gz Amphilophus citrinellus
Malmstrom et al. 2017 Anabas testudineus
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000695075.1_Anguilla_anguilla_v1_09_nov_10/GCA_000695075.1_Anguilla_anguilla_v1_09_nov_10_genomic.fna.gz Anguilla anguilla
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000470695.1_japanese_eel_genome_v1_25_oct_2011_japonica_c401b400k25m200_sspacepremiumk3a02n24_extra.final.scaffolds/GCA_000470695.1_japanese_eel_genome_v1_25_oct_2011_japonica_c401b400k25m200_sspacepremiumk3a02n24_extra.final.scaffolds_genomic.fna.gz Anguilla japonica
Musilova et al. 2018 Anoplogaster cornuta
Malmstrom et al. 2017 Antennarius striatus
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000239415.1_AstBur1.0/GCF_000239415.1_AstBur1.0_genomic.fna.gz Astatotilapia burtoni
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000372685.1_Astyanax_mexicanus-1.0.2/GCF_000372685.1_Astyanax_mexicanus-1.0.2_genomic.fna.gz Astyanax mexicanus
Malmstrom et al. 2017 Bathygadus melanobranchus
Malmstrom et al. 2017 Benthosema glaciale
Malmstrom et al. 2017 Beryx splendens
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000788275.1_BP.fa/GCA_000788275.1_BP.fa_genomic.fna.gz Boleophthalmus pectinirostris
Malmstrom et al. 2017 Borostomias antarcticus
Malmstrom et al. 2017 Brosme brosme
Malmstrom et al. 2017 Brotula barbata
SRX360276, GenBank, see Musilova et al. 2018 Caranx ignobilis
SRX360285, GenBank, see Musilova et al. 2018 Caranx melampygus
Malmstrom et al. 2017 Carapus acus
Musilova et al. 2018 Cetomimus sp
Malmstrom et al. 2017 Chaenocephalus aceratus
Malmstrom et al. 2017 Chatrabus melanurus
Malmstrom et al. 2017 Chromis chromis
SRX203077, GenBank, see Musilova et al. 2018 Clupea harengus
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000523025.1_Cse_v1.0/GCF_000523025.1_Cse_v1.0_genomic.fna.gz Cynoglossus semilaevis
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000776015.1_ASM77601v1/GCA_000776015.1_ASM77601v1_genomic.fna.gz Cyprinodon nevadensis
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000776015.1_ASM77601v1/GCA_000776015.1_ASM77601v1_genomic.fna.gz Cyprinodon variegatus
SRX317090, GenBank, see Musilova et al. 2018 Cyprinus carpio
Malmstrom et al. 2017 Cyttopsis rosea
ftp://ftp.ensembl.org/pub//release-78/fasta/danio_rerio/dna/Danio_rerio.Zv9.dna.toplevel.fa.gz Danio rerio
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000689215.1_seabass_V1.0/GCA_000689215.1_seabass_V1.0_genomic.fna.gz Dicentrarchus labrax
Musilova et al. 2018 Diretmoides pauciradiatus
Musilova et al. 2018 Diretmus argenteus
SRX554947, GenBank, see Musilova et al. 2018 Electrophorus electricus
ERX432347, GenBank, see Musilova et al. 2018 Epinephelus aeneus
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000721915.2_ASM72191v2/GCF_000721915.2_ASM72191v2_genomic.fna.gz Esox lucius
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000826765.1_Fundulus_heteroclitus-3.0.2/GCF_000826765.1_Fundulus_heteroclitus-3.0.2_genomic.fna.gz Fundulus heteroclitus
Malmstrom et al. 2017 Gadus morhua
ftp://ftp.ensembl.org/pub//release-78/fasta/gasterosteus_aculeatus/dna/Gasterosteus_aculeatus.BROADS1.dna.toplevel.fa.gz Gasterosteus aculeatus
Musilova et al. 2018 Gephyroberyx darwinii
Malmstrom et al. 2017 Guentherus altivela
Malmstrom et al. 2017 Helostoma temminkii
Malmstrom et al. 2017 Holocentrus rufus
Musilova et al. 2018 Hoplostethus atlanticus
Malmstrom et al. 2017 Lampris guttatus
Malmstrom et al. 2017 Lamprogrammus exutus
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000742935.1_ASM74293v1/GCF_000742935.1_ASM74293v1_genomic.fna.gz Larimichthys crocea
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000242695.1_LepOcu1/GCF_000242695.1_LepOcu1_genomic.fna.gz Lepisosteus oculatus
Malmstrom et al. 2017 Lesueurigobius cf. sanzoi
Musilova et al. 2018 (unpublished) Lophius vaillanti
Malmstrom et al. 2017 Macrourus berglax
Malmstrom et al. 2017 Merluccius polli
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000238955.2_M_zebra_UMD1/GCF_000238955.2_M_zebra_UMD1_genomic.fna.gz Metriaclima zebra
Malmstrom et al. 2017 Monocentris japonica
SRX218060, GenBank, see Musilova et al. 2018 Monopterus albus
Malmstrom et al. 2017 Mora moro
Malmstrom et al. 2017 Myoxocephalus scorpius
Malmstrom et al. 2017 Myripristis jacobus
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000239395.1_NeoBri1.0/GCF_000239395.1_NeoBri1.0_genomic.fna.gz Neolamprologus brichardi
Malmstrom et al. 2017 Neoniphon sammara
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000735185.1_NC01/GCF_000735185.1_NC01_genomic.fna.gz Notothenia coriiceps
Musilova et al. 2018 Opsanus beta
ftp://ftp.ensembl.org/pub//release-78/fasta/oreochromis_niloticus/dna/Oreochromis_niloticus.Orenil1.0.dna.toplevel.fa.gz Oreochromis niloticus
ftp://ftp.ensembl.org/pub//release-78/fasta/oryzias_latipes/dna/Oryzias_latipes.MEDAKA1.dna.toplevel.fa.gz Oryzias latipes
Malmstrom et al. 2017 Osmerus eperlanus
Malmstrom et al. 2017 Parablennius parvicornis
Malmstrom et al. 2017 Parasudis fraserbrunneri
Malmstrom et al. 2017 Perca fluviatilis
Malmstrom et al. 2017 Percopsis transmontana
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000787095.1_PS.fa/GCA_000787095.1_PS.fa_genomic.fna.gz Periophthalmodon schlosseri
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000787105.1_PM.fa/GCA_000787105.1_PM.fa_genomic.fna.gz Periophthalmus magnuspinnatus
SRX423854, GenBank, see Musilova et al. 2018 Pimephales promelas
ftp://ftp.ensembl.org/pub//release-78/fasta/poecilia_formosa/dna/Poecilia_formosa.PoeFor_5.1.2.dna.toplevel.fa.gz Poecilia formosa
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000633615.1_Guppy_female_1.0_MT/GCF_000633615.1_Guppy_female_1.0_MT_genomic.fna.gz Poecilia reticulata
Malmstrom et al. 2017 Polymixia japonica
Malmstrom et al. 2017 Pseudochromis fuscus
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000787555.1_Pyoko_1.0/GCA_000787555.1_Pyoko_1.0_genomic.fna.gz Pseudopleuronectes yokohamae
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000239375.1_PunNye1.0/GCF_000239375.1_PunNye1.0_genomic.fna.gz Pundamilia nyererei
Malmstrom et al. 2017 Regalecus glesne
Malmstrom et al. 2017 Rondeletia loricata
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000233375.1_ICSASG_v2/GCF_000233375.1_ICSASG_v2_genomic.fna.gz Salmo salar
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000787155.1_SH.fa/GCA_000787155.1_SH.fa_genomic.fna.gz Scartelaos histophorus
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_001005745.2_aro_v2/GCA_001005745.2_aro_v2_genomic.fna.gz Scleropages formosus
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000475235.1_Snig1.0/GCA_000475235.1_Snig1.0_genomic.fna.gz Sebastes nigrocinctus
Malmstrom et al. 2017 Sebastes norvegicus
Malmstrom et al. 2017 Selene dorsalis
Malmstrom et al. 2017 Spondyliosoma cantharus
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000690725.1_Stegastes_partitus-1.0.2/GCF_000690725.1_Stegastes_partitus-1.0.2_genomic.fna.gz Stegastes partitus
Malmstrom et al. 2017 Stylephorus chordatus
Malmstrom et al. 2017 Symphodus melops
Musilova et al. 2018 (unpublished) Syngnathus typhle
Musilova et al. 2018 (unpublished) Synodus synodus
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000400755.1_version_1_of_Takifugu_flavidus_genome/GCA_000400755.1_version_1_of_Takifugu_flavidus_genome_genomic.fna.gz Takifugu flavidus
ftp://ftp.ensembl.org/pub//release-78/fasta/takifugu_rubripes/dna/Takifugu_rubripes.FUGU4.dna.toplevel.fa.gz Takifugu rubripes
ftp://ftp.ensembl.org/pub//release-78/fasta/tetraodon_nigroviridis/dna/Tetraodon_nigroviridis.TETRAODON8.dna.toplevel.fa.gz Tetraodon nigroviridis
Malmstrom et al. 2017 Thunnus albacares
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000418415.1_Thunnus_orientalis_ver_Ba_1.0/GCA_000418415.1_Thunnus_orientalis_ver_Ba_1.0_genomic.fna.gz Thunnus orientalis
Malmstrom et al. 2017 Trisopterus minutus
Malmstrom et al. 2017 Typhlichthys subterraneus
ftp://ftp.ensembl.org/pub//release-78/fasta/xiphophorus_maculatus/dna/Xiphophorus_maculatus.Xipmac4.4.2.dna.toplevel.fa.gz Xiphophorus maculatus
Malmstrom et al. 2017 Zeus faber
Malmstrom et al. 2017 Arctogadus glacialis
Malmstrom et al. 2017 Molva molva
Malmstrom et al. 2017 Lota lota
Malmstrom et al. 2017 Brosme brosme
Malmstrom et al. 2017 Merluccius merluccius
Malmstrom et al. 2017 Merluccius polli
Malmstrom et al. 2017 Melanonus zugmayeri
Malmstrom et al. 2017 Macrourus berglax
Malmstrom et al. 2017 Malacocephalus occidentalis
Malmstrom et al. 2017 Bathygadus melanobranchus
Malmstrom et al. 2017 Boreogadus saida
Malmstrom et al. 2017 Muraenolepis marmoratus
Malmstrom et al. 2017 Bregmaceros cantori
Malmstrom et al. 2017 Mora moro
Malmstrom et al. 2017 Trisopterus minutus
Malmstrom et al. 2017 Trachyrincus scabrus
Malmstrom et al. 2017 Pollachius virens
Malmstrom et al. 2017 Melanogrammus aeglefinus
Malmstrom et al. 2017 Merlangius merlangus
Malmstrom et al. 2017 Theragra chalcogramma
Malmstrom et al. 2017 Gadiculus argenteus
Malmstrom et al. 2017 Phycis phycis
Malmstrom et al. 2017 Phycis blennoides
Malmstrom et al. 2017 Gadus morhua
Methods
For TE annotation, we used a variant of the computational pipeline that is more thoroughly described in (Tørresen et al. 2017), available at https://github.com/uio-cels/Repeats. The pipeline includes multiple TE detection steps using different tools, steps for removing non-TEs from the detected sequences and steps for classifying the elements. For the initial detection step, we used RepeatModeler (v. 1.0.8) (Smit & Hubley 2008-2015) and LTRharvest (part of GenomeTools v. 1.5.7) (Ellinghaus et al. 2008). RepeatModeler detects all sorts of repetitive sequences and LTRharvest is specialized for detecting LTR-RTs. Using BLASTX, TEs with sequences matching known non-TEs in UniProtKB/Swiss-Prot were removed. To classify the TEs, we used RepeatClassifier, which is a part of the RepeatModeler software. As the tool did not manage to classify all of the remaining sequences, additional similarity searches were performed between the sequences and a curated library of TE sequences (RepBase v. 20150807), using nucleotide BLAST. Finally, we built Hidden Markov Model profiles from the detected sequences using HMMER (v. 3.1b1) (Wheeler & Eddy 2013) and compared the profiles with HMM profiles from databases downloaded from GyDB.org (Llorens et al. 2011) and dfam.org (Hubley et al. 2016), using the nhmmer feature included in HMMER. This resulted in additional sequences being classified at the class and subclass level. The pipeline resulted in one de novo library per assembly, which contained the consensus sequences of the interspersed repeats detected in each assembly.
Usage notes
This repository contain one de novo library (FASTA file) per genome assembly. The results of masking each genome assembly with RepeatMasker using these de novo libraries can be found at https://doi.org/10.6084/m9.figshare.8280800.