Transposable elements (TEs)—selfish DNA sequences that can move within the genome—comprise a large proportion of the genomes of many organisms. Although low-coverage whole genome sequencing can be used to survey TE composition, it is non-economical for species with large quantities of DNA. Here, we utilize restriction site associated DNA sequencing (RADSeq) as an alternative method to survey TE composition. First, we demonstrate in silico that double digest restriction-site associated DNA sequencing (ddRADseq) markers contain the same TE compositions as whole genome assemblies across arthropods. Next, we show empirically using eight Synalpheus snapping shrimp species with large genomes that TE compositions from ddRADseq and low-coverage whole genome sequencing are comparable within and across species. Finally, we develop a new bioinformatic pipeline, TERAD, to extract TE compositions from RADseq data. Our study expands the utility of RADseq to study the repeatome, making comparative studies of genome structure for species with large genomes more tractable and affordable.
Dryad_TERAD_output_8spX4sample
Output from TERAD pipeline for 32 samples of Synalpheus snapping shrimps. The data has 8 species and 4 samples for each species. We compared TE composition between data from ddRADseq and LC-WGS in eight snapping shrimp species in the genus Synalpheus (Alpheidae). We used one sample per species for both ddRADseq and LC-WGS, and then added three additional samples (four total) per species for ddRADseq. For explanation of columns, see https://github.com/solomonchak/TERAD.
Dryad_LCWGS_TEpropXFamily
Summary of genome percentages of major TE groups corresponding to the ddRADseq analysis. The data has eight species and one sample per species.
Dryad_Arthropod_TE_from_Assembly_vs_simulated_ddRAD
Genomic proportions of TEs from whole genome assembly and the proportional count of simulated ddRADseq markers that contained TEs from 16 arthropod species from three major lineages of arthropods (Hexapoda, Chelicerata, and Crustacea). We used all of the crustacean whole genome assemblies available at the time of analysis (February 2018), as well as other arthropod genomes in which both whole genomes and genome size data were available. Using a custom script (see Supporting information: simddRAD.sh), we simulated ddRADseq markers from whole genome assemblies using the five combinations of dual restriction enzymes (SbfI-EcoRI, SphI-EcoRI, EcoRI-MspI, SphI-MluCI and NlaIII-MluCI) with increasing genome coverage and a wide size selection criterion (300 ± 36 bp). These enzyme combinations were reported by Peterson et al. (2012) to generate ddRADseq markers across two orders of magnitude of genome coverage in most species. We used RepeatMasker (Smit, Hubley, & Green, 2015) and the arthropod repeat database in Repbase v 20181026 (Bao, Kojima, & Kohany, 2015) to identify TEs in both simulated ddRADseq markers and whole genome assemblies. For genome assemblies, we calculated the proportion of base pairs of the genome that contained TEs. For ddRADseq markers, we calculated the proportional count of ddRADseq markers that contained TEs because ddRADseq only samples a small fraction of the genome and therefore the proportion of TE base pairs is unlikely to be accurate.