DNA gains and losses in gigantic genomes do not track differences in transposable element-host silencing interactions
Data files
Abstract
Size evolution among gigantic genomes involves gain and loss of many gigabases of transposable elements (TEs), sequences that parasitize host genomes. Vertebrates suppress TEs using piRNA and KRAB-ZFP pathways. TEs and hosts coevolve in an arms race, where suppression strength reflects TE fitness costs. In enormous genomes, additional TE costs become miniscule. How, then, do TEs and host suppression invoke further addition of massive DNA amounts? We analyze TE proliferation histories, deletion rates, and community diversities in six salamander genomes (21.3 - 49.9 Gb), alongside gonadal expression of TEs and suppression pathways. We show that TE activity is higher in testes than ovaries, attributable to lower KRAB-ZFP suppression. Unexpectedly, genome size and expansion are uncorrelated with TE deletion rate, proliferation history, expression, and host suppression. Also, TE community diversity increases with genome size, contrasting theoretical predictions. We infer that TE-host antagonism in gigantic genomes produces stochastic TE accumulation, reflecting noisy intermolecular interactions in huge genomes and cells.
Dataset DOI: 10.5061/dryad.zpc866tkv
Our manuscript described differences in TE composition, TE diversity, TE proliferation history based on histograms of sequence divergence, and TE ectopic recombination-mediated deletion from the genomes of six salamander species with different genome sizes. In addition, we quantified expresion of TEs and piRNAs that target TEs, as well as genes in TE silencing pathways.
Description of the data and file structure
The data we include here in this DRYAD submission are those that we used to generate the five figures in our manuscript. For each of the five figures, we include a folder that includes an excel spreadsheet with the relevant data, as well as the R code for producing the figure itself. When relevant, we also include tree files.
Sharing/Access information
Genomic shotgun and transcriptome sequences have been deposited in the Genome Sequence Archive at the National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation (GSA: CRA008892, CRA008899, CRA008900, CRA009444, CRA009736, CRA009747, CRA009748, CRA009749), and are publicly accessible at http://bigd.big.ac.cn/gsa.
Files and variables
File Folders: Fig1, Fig2, Fig3, Fig4, Fig5
Description: Each figure is associated with its own folder, which includes all files with the underlying data as well as R code for generating the figure.
Figure 1 includes a spreadsheet with percent sequencing reads mapped in bins of different percent sequence divergence for each TE superfamily in each salamander species. In addition, Figure 1 includes a spreadsheet that contains the percentage of the genome made up of different TE classes in each species, as well as the percentage of the genome made up of each different TE superfamily. There is also a newick file that contains the phylogeny and branch lengths depicted on the left side of the figure. Finally, there is an R script that will generate all of the graphs included in Figure 1, and the salamander image files included in the figure are also included in the folder.
Figure 2 includes a spreadsheet with TE community diversity calculated using both Shannon's (SI) and Gini-Simpson (GSI) indices, as well as genome size (GS) and membership in major vertebrate clades/taxonomic groups. There is an R script that will generate the graphs included in Figure 2.
Figure 3 includes a spreadsheet that contains information on the ratio of terminal to internal sequences for between 12 and 25 individual TE contigs for each of 5 salamander species. There is also an R script that will generate the graph in Figure 3.
Figure 4 includes two spreadsheets that include species name, TE superfamily, and for each individual sampled, whether it is male or female, the percentage of the genome consisting of the TE superfamily, and the average expression level of each TE superfamily in TPM (transcripts per million). This workbook is entitled "genome_transcriptTE." In addition, there is a sheet called "Data TE exp piRNA mapping" that includes species name, TE superfamily, male or female for each individual, and then the piRNA expression levels (piRNA map TE RPM, or Reads Per Million) and the TE expression level (as TPM, or transcripts per million).There is also an R script that will generate the graph in Figure 4.
Figure 5 includes a spreadsheet summarizing expression of genes involved in different TE silencing pathways. The columns specify the TE silencing pathway and the rows specify total expression levels of the genes in that pathway normalized by expression of genes in the miRNA pathway. There is also an R script that will generate the graphs in Figure 5.
List of files in the “Figures_data_and_code” folder:
Figures_data_and_code
├── Fig_1
│ ├── Data_divergence dynamics.xlsx
│ ├── Data_phylogeny_tree_species_name.txt
│ ├── Data_TE_superfamily_coverage.xlsx
│ ├── Data_TE_type_percentage.xlsx
│ ├── Fig1_April_16ps_final.pdf
│ ├── Fig1_code.R
│ ├── Fig1_R_output_April_16.pdf
│ ├── Photo_Andrias.jpg
│ ├── Photo_Cynops.jpg
│ ├── Photo_Pachytriton.jpg
│ ├── Photo_Paramesotriton.jpg
│ ├── Photo_Ranodon.jpg
│ └── Photo_Tylototriton.jpg
│
├── Fig_2
│ ├── Data_TE_diversity.xlsx
│ ├── Fig2_April_17ps_final.pdf
│ ├── Fig2_code.R
│ └── Fig2_R_output_April_17.pdf
│
├── Fig_3
│ ├── Data_depth_TE.xlsx
│ ├── Fig3_April_17ps_final.pdf
│ ├── Fig3_code.R
│ └── Fig3_R_output_April_17.pdf
│
├── Fig_4
│ ├── Data_genome_transcript_TE.xlsx
│ ├── Data_TE_exp_piRNA_mapping.xlsx
│ ├── Fig4_April_17ps_final.pdf
│ ├── Fig4_code.R
│ └── Fig4_R_output_April_16.pdf
│
└── Fig_5
├── Data_pathway_gene.xlsx
├── Fig5_April_17ps_final.pdf
├── Fig5_code.R
└── Fig5_R_output_April_16.pdf
File descriptions:
All figure files in the .pdf format were generated in R using the corresponding *_code.R scripts.
All files named “Data_***.xlsx” or “Data_.txt” serve as the input files for R operations.
Code/software
None
