Transposable elements (TEs) are repetitive DNA sequences that can make new copies of themselves that are inserted elsewhere in a host genome. The abundance and distributions of TEs vary considerably among phylogenetically diverse hosts. With the aim of exploring the basis of this variation, we evaluated correlations between several genomic variables and the presence of TEs and non-TE repeats in the complete genome sequence of the Western clawed frog (Silurana tropicalis). This analysis reveals patterns of TE insertion consistent with gene disruption but not with the insertional preference model. Analysis of non-TE repeats recovered unique features of their genome-wide distribution when compared with TE repeats, including no strong correlation with exons and a particularly strong negative correlation with GC content. We also collected polymorphism data from 25 TE insertion sites in 19 wild-caught S. tropicalis individuals. DNA transposon insertions were fixed at eight of nine sites and at a high frequency at one of nine, whereas insertions of long terminal repeat (LTR) and non-LTR retrotransposons were fixed at only 4 of 16 sites and at low frequency at 12 of 16. A maximum likelihood model failed to attribute these differences in insertion frequencies to variation in selection pressure on different classes of TE, opening the possibility that other phenomena such as variation in rates of replication or duration of residence in the genome could play a role. Taken together, these results identify factors that sculpt heterogeneity in TE distribution in S. tropicalis and illustrate that genomic dynamics differ markedly among TE classes and between TE and non-TE repeats.
Logistic regression input and related R script, description of TE familes
Supplementary Material: Description of Supplementary Files. We include 22 files as supplementary material, including (a) all TE and non-TE repeat fragments in the Silurana tropicalis genome, as reported by RepeatMasker (b) input files to run logistic regression in R, for the models where TEs are not included in the GC calculations (c) input files to run logistic regression in R, for the model where TEs are included in the GC calculations (d) R script to run both logistic regression models (e) descriptions of TE and non-TE repeat classes (f) description of TE families. (a) 2012RepeatMasked: this is the standard output file from RepeatMasker. Columns include: SW score, percent div. (percent diverged), percent del. (percent deletion), percent ins., begin position in query, end position in query, TE family of matching repeat, repeat class/family, begin and end position in repeat, and RepeatMasker assigned ID number. (b)Input files contain information on the various genomic features and the presence of TE or non-TE repeats in 2 kilobase windows. All windows are ordered in the same manner across the different input files for the same model (model for GC calculated including TE and non-TE repeats or not). Because number of windows in the 2 models are different (we have to exclude windows that are completely full of TE from the model of GC content calculated without including TEs), 2 sets of input files are used. a. TENoTENotContam.csv. This file denotes the presence or absence of TE and non-TE repeats with a 1 or 0. The columns are of different TE or non-TE repeat classes and the rows of are of the presence or absence of these repeats in 2 kilobase windows. b. The fourth column of ConservedNoTE.txt, exonsNoTE.txt, intronsNoTE.txt and GCNoTE.txt lists the proportion of window that is conserved across species, is exon, is intron, and percent GC content respectively. The first 3 columns are “linkage group or not” (whether we concatenated different chromosomes into a linkage group “LG” or not “nonLG”), “linkage group number or chromosome number”, “genomic window number”. The total c. distancesNoTE.txt lists the distance, in basepairs, of the closest gene up or downstream in columns 4 and 5 respectively. expressionNoTE.txt lists the proportion of windows that is expressed in germline genes or soma genes in columns 4 and 5 respectively. The first 3 columns are again “linkage group or not” (whether we concatenated different chromosomes into a linkage group “LG” or not “nonLG”), “linkage group number or chromosome number”, “genomic window number”. (c) This pattern of input files is repeated for the model where percent GC content in a window is calculated including TE and non-TE repeats. The files are similarly named as TENotContam.csv, Conserved.txt, exons.txt, introns.txt, GC.txt, distances.txt and expression.txt. (d) For long TEs and short TEs, separate files are provided for the presence or absence of TE in genomic windows, where GC content is calculated either including or excluding TEs and non-TE repeats. These files are titled “TELong.csv”,”TELongNoTE.csv”, “TEshort.csv”, “TEShortNoTE.csv”. (e) modelTE.R is the script used to read the input files and run the logistic regressions. (f) Summary of classes.xlsx is the summary of total length and number of fragments of different classes of TEs and non-TE Repeat (g) Summary of Families.xlsx is the summary of total TE or non-TE repeat fragments in each TE family or non-TE Repeat
suppMat.zip