Skip to main content
Dryad

Combined analysis of transposable elements and structural variation in maize genomes reveals genome contraction outpaces expansion

Cite this dataset

Munasinghe, Manisha et al. (2023). Combined analysis of transposable elements and structural variation in maize genomes reveals genome contraction outpaces expansion [Dataset]. Dryad. https://doi.org/10.5061/dryad.5qfttdz9t

Abstract

Background – Structural differences between genomes are a major source of genetic variation that contributes to phenotypic differences. Transposable elements, mobile genetic sequences capable of increasing their copy number and propagating themselves within genomes, can generate structural variation. However, their repetitive nature makes it difficult to characterize fine-scale differences in their presence at specific positions, limiting our understanding of their impact on genome variation. Domesticated maize is a particularly good system for exploring the impact of transposable element proliferation as over 70% of the genome is annotated as transposable elements. High-quality transposable element annotations were recently generated for de-novo genome assemblies of 26 diverse inbred maize lines.

Results – We generated base-pair resolved pairwise alignments between the B73 maize reference genome and the remaining 25 inbred maize line assemblies. From this data, we classified transposable elements as either shared or polymorphic in a given pairwise comparison. Our analysis uncovered substantial structural variation between lines, representing both putative insertion and deletion events. Putative insertions in SNP-depleted regions, which represent recently diverged identity by state blocks, suggest some TE families may still be active. However, our analysis reveals that genome-wide, deletions of transposable elements account for more structural variation than insertions. These deletions are often large structural variants containing multiple transposable elements.

Conclusions – Combined, our results highlight how transposable elements contribute to structural variation and demonstrate that deletion events are a major contributor to genomic differences.

README: Combined analysis of transposable elements and structural variation in maize genomes reveals genome contraction outpaces expansion

Structural differences between genomes are a major source of genetic variation. Transposable elements, or TEs, are mobile genetic sequences capable of increasing their copy number and propagating themselves within genomes. TEs can generate structural variation as a result of their transposition. Advancements in long-read sequencing technologies and computation methods have enhanced our ability to characterize and investigate structural variation between genomes. Detailed characterization of multiple haplotypes for several loci in domestic maize (Zea mays ssp. mays) revealed extensive structural polymorphisms for TE content. Given the high TE content of the maize genome, it is likely that TEs are a major contributor to structural variants (SVs), but this has yet to be fully quantified.

Recently, high-quality TE annotations were recently generated for de-novo genome assemblies of 26 diverse inbred maize lines used to generate the nested association mapping (NAM) population. Using these genome assemblies, we generated base-pair resolved pairwise alignments between the B73 maize reference genome and the remaining 25 inbred maize line assemblies. From this data, we classified features (either TEs or genes) as either shared or polymorphic in a given pairwise comparison. This data repository contains several relevant datasets both generated and used in our analysis. We are sharing this data here publicly for use by any interested parties. Scripts used in this analysis can be found separately at https://github.com/mam737/PolymorphicTEs_NAM.

Description of the data and file structure

Data included:

(1) Pairwise alignments between the B73 maize reference genome and the remaining 25 inbred maize line assemblies. This data was generated using AnchorWave (v1.0.1). The MAFToGVCF plugin of tassel v5.2.82 was used to reformat genome alignments in MAF format outputted by AnchorWave into variant calling records in GVCF format. Filenames for these datasets have the following structure - NAM_Line_Comparison.file_format.gz. So, B97.maf.gz represents the MAF output for the pairwise alignment between the B73 reference genome and the B97 inbred maize line. There are 25 maf.gz and 25 gvcf.gz files. Introductions to MAF file formats can be found here - https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/, and introductions to GVCF file formats can be found here - https://gatk.broadinstitute.org/hc/en-us/articles/360035531812-GVCF-Genomic-Variant-Call-Format.

(2) Summarised AnchorWave Regions - GVCFs generated by AnchorWave were subsequently parsed to break the alignment down into 'alignable', 'structural variant', or 'unalignable' sequences. Given our interest in shared or polymorphic TEs, we condensed the GVCF output by combining the nonvariant sites, single nucleotide variants, and small (<50bp) insertion/deletion events into a single class of 'alignable' regions. 'Structural variants' in one genotype relative to another were defined as regions >50bp for which the size in the other genotype is 0bp. The remaining variants all include at least one base pair in each genotype that is not fully aligned, and these regions were consequently classified as 'unalignable'. 'Unalignable' regions also include gaps in the AnchorWave alignment. Within this zipped file, there are 25 folders referring to each of the pairwise AnchorWave alignments. In each folder, there are 10 files, one for each chromosome of the alignment. Column structure is as follows

  • Column 1 - B73 Chromosome
  • Column 2 - B73 Start Position
  • Column 3 - B73 End Position
  • Column 4 - NAM Chromosome
  • Column 5 - NAM Start Position
  • Column 6 - NAM End Position
  • Column 7 - Region Type (either 'alignable_region', 'structural_insertion_inB73', 'structural_insertion_inNAM', or 'unalignable')
  • Column 8 - Unique Region Identifier for this AnchorWave Alignment

(3) Polymorphic TE Calls. By intersecting publicly available TE annotations generated using EDTA with our pairwise alignments, we could classifiy any TE annotation as either shared or polymorphic in the given comparison. The pairwise nature of these alignments means that the B73 TE annotations would be classified 25 times relative to each NAM line, while each NAM line TE annotation would be classified once relative to B73. polymorphic_TE_calls.tar.gz contains all TE annotation classifications. Within this folder, there are two subfolders - B73 or NAM. B73 houses the 25 B73 TE annotations, while NAM houses the 25 NAM TE annotations. Within each folder, the file name structure is as follows B73_NAM_TE_classification_by_n.tsv where NAM indicates which inbred line is being compared. So, the file ~/polymorphic_TE_calls/B73/B73_B97_TE_classification_by_n.tsv classifies every B73 TE as either shared, polymorphic, or ambiguous (could not be classified as either shared or polymorphic) relative to the B73 versus B97 genome pairwise alignment. Column structure is the same regardless of whether you are looking at the B73 classifications or the compared NAM classifications.

  • Column 1 - TE_name - unique identifier for that TE annotation
  • Column 2 - chr - chromsomal location of that TE annotation
  • Column 3 - start - start base pair position of that TE annotation
  • Column 4 - end - end base pair position of that TE annotation
  • Column 5 - method - indicates whether annotation was identified using a structural approach (structural) or homology-based approach (homology)
  • Column 6 - type - denotes the identified type of TE annotation
  • Column 7 - class - denotes whether the TE annotation is a Class I or Class II TE
  • Column 8 - raw_superfamily - denotes the superfamily assigned by the EDTA TE annotation software
  • Column 9 - upd_superfamily - raw_superfamily names were transformed into shorted codes to provide updated superfamily names
  • Column 10 - condense_superfamily - further condensed superfamily names
  • Column 11 - raw_family - family name associated with TE annotation
  • Column 12 - alignable_region - proportion of TE annotation that overlaps alignable regions in pairwise alignment
  • Column 13 - structural_insertion_inB73 - proportion of TE annotation that overlaps structural variants with sequence in B73 (this column is entirely 0 for NAM TE annotations)
  • Column 15 - structural_insertion_inNAM - proportion of TE annotation that overlaps structural variants with sequence in NAM (this column is entirely 0 for B73 TE annotations)
  • Column 16 - unalignable - proportion of TE annotation that overlaps unalignable sequence identified from pairwise alignment
  • Column 17 - Missing_Data - proportion of TE annotation that overlaps gaps in the pairwise alignment
  • Column 18 - AW_Blocks - Unique identifier of AnchorWave blocks that TE annotation overlaps
  • Column 19 - classification - final classification group assigned (shared, polymorphic, or ambiguous)

(4) Polymorphic Gene Calls. By intersecting publicly available gene annotations with our pairwise alignments, we could classify any gene annotation as either shared or polymorphic in the given comparison. The pairwise nature of these alignments means that the B73 gene annotations would be classified 25 times relative to each NAM line, while each NAM line gene annotation would be classified once relative to B73. polymorphic_gene_calls.tar.gz contains all gene annotation classifications. Within this folder, there are two subfolders - B73 or NAM. Genes could be classified using either an exon-only or full-length approach. B73 houses the 25 B73 gene annotations, while NAM houses the 25 NAM gene annotations. Separate folders are used to house either the by_exon or by_full length analysis. Within each folder, the file name structure is as follows B73_NAM_gene_classification_by_type.tsv where NAM indicates which inbred line is being compared and type indicates whether it is an exon-only or full-length call. So, the file ~/polymorphic_gene_calls/B73/exon_calls/B73_B97_gene_classification_by_exon.tsv classifies every B73 gene as either shared, polymorphic, or ambiguous (could not be classified as either shared or polymorphic) relative to the B73 versus B97 genome pairwise alignment taking an exon-only gene approach. Column structure is the same regardless of whether you are looking at the B73 classifications or the compared NAM classifications.

  • Column 1 - id_name - unique identifier for that gene annotation
  • Column 2 - chr - chromsomal location of that gene annotation
  • Column 3 - start - start base pair position of that gene annotation
  • Column 4 - end - end base pair position of that gene annotation
  • Column 5 - exon_num - number of exons for that gene (This column is ONLY present for exon-only approach)
  • Column 6 - alignable_region - proportion of gene annotation that overlaps alignable regions in pairwise alignment
  • Column 7 - structural_insertion_inB73 - proportion of gene annotation that overlaps structural variants with sequence in B73 (this column is entirely 0 for NAM gene annotations)
  • Column 8 - structural_insertion_inNAM - proportion of gene annotation that overlaps structural variants with sequence in NAM (this column is entirely 0 for B73 gene annotations)
  • Column 9 - unalignable - proportion of gene annotation that overlaps unalignable sequence identified from pairwise alignment
  • Column 10 - Missing_Data - proportion of gene annotation that overlaps gaps in the pairwise alignment
  • Column 11 - AW_Blocks - Unique identifier of AnchorWave blocks that gene annotation overlaps
  • Column 12 - classification - final classification group assigned (shared, polymorphic, or ambiguous)

(5) Sliding window normalized SNP rate values derived from pairwise alignments. From each pairwise alignment, the number of SNPs and amount of alignable sequence could be determined. We used a sliding window approach to count the number of SNPs and base pairs of alignable sequences in 1Mb windows offset by 250kb from the start to end of each chromosome. Normalised SNP counts for each 1Mb window were then determined by dividing the SNP count by the total amount of alignable sequence in the 1Mb window. These sliding window normalized SNP counts can be found in SNP_density_summaries.tar.gz. There are 25 files within here that calculate these normalized SNP counts from each pairwise alignment. Column structure is the same across all 25 files.

  • Column 1 - chr - chromosome position of sliding window
  • Column 2 - BinStart - start position in B73 of sliding window
  • Column 3 - BinEnd - end position in B73 of sliding window
  • Column 4 - SNP_count - number of identified SNPs in the sliding window
  • Column 5 - nonSV_bp - number of non-structural variant base pairs in B73 for that region
  • Column 6 - norm_SNP_count - normalized SNP count for that region calculated by taking the raw SNP count and dividing by the amount of nonSV sequence in that window.

(6) Structural Variant Category and TE Relationship. We classified every structural variant (SV) depending on how TE annotations overlapped the SV. SVs could be categorized as either: "No TE SV", "Incomplete TE SV", "TE = SV", "Multi TE SV", or "TE Within SV". We include two data files: B73_Structural_Variant_Category_Info.tsv and NAM_Structural_Variant_Category_Info.tsv which contains the category information for either all of the SVs called with sequence in B73 relative to another NAM line or all of the SVs called with sequence in another NAM line relative to B73 respectively. For any SV that overlaps a TE, the TE names are listed. This can be linked to the TE_name in the polymorphic_TE_calls to obtain details on the annotation. Column structure is as follows

  • Column 1 - AW_BlockID - The unique region identifier for this AnchorWave Structural Variant
  • Column 2 - Lineage_Comp - The NAM line comparison from which this was SV was called.
  • Column 3 - chr - The chromosome where the SV was located
  • Column 4 - start - The start position of the SV
  • Column 5 - end - The end position of the SV
  • Column 6 - Category - Which category this SV falls into: "No TE SV", "Incomplete TE SV", "TE = SV", "Multi TE SV", or "TE Within SV".
  • Column 7 - Overlapping_TEs - A list of the TEs that overlap the SV

Sharing/Access information

Please contact mmunasin@umn.edu if you have any difficulties accessing the data or if you have any questions. We are very happy to assist folks in using this data. Publicly available datasets used in this analysis were found on MaizeGDB. Genome assemblies were found https://www.maizegdb.org/genome. TE and gene annotations were downloaded from https://maizegdb.org/NAM_project, while synteny classifications for the NAM genes were downloaded from https://ars-usda.app.box.com/v/maizegdb-public/folder/186350887665.

Code/Software

Scripts used to filter and analyze data are available on GitHub at https://github.com/mam737/PolymorphicTEs_NAM.To visualize pairwise alignments with overlapping TE and gene annotations, an R Shiny App Web Browser was developed and can be found at https://mmunasin.shinyapps.io/nam_sv/.

Methods

High-quality genome assemblies for the 26 Nested Association  Mapping (NAM) inbred founder lines were downloaded from MaizeGDB (https://www.maizegdb.org/genome). AnchorWave v1.0.1 was used to perform pairwise whole genome alignments to compare each of the NAM inbred genomes to the B73 reference genome (included in the NAM founder line set) for a total of 25 pairwise whole-genome alignments via the 'genoAli' command and '-IV' parameter. The MAFToGVCF plugin of tassel v5.2.82 was used to reformat genome alignments in MAF format into variant calling records in GVCF format. Both the MAF and GVCF formats are provided here. 

TE annotations, gene annotations, and gene synteny calls were downloaded from MaizeGDB. TE and gene annotations were downloaded from  https://maizegdb.org/NAM_project, while synteny classifications for the NAM genes were downloaded from https://ars-usda.app.box.com/v/maizegdb-public/folder/186350887665.

Scripts used to filter publicly available datasets and to generate new data can be found on GitHub at https://github.com/mam737/PolymorphicTEs_NAM along with a README detailing what each script does. 

Funding

National Science Foundation, Award: IOS-2010908

National Science Foundation, Award: IOS-2109697

National Science Foundation, Award: IOS-1907343

National Science Foundation, Award: IOS-1934384