Combined analysis of transposable elements and structural variation in maize genomes reveals genome contraction outpaces expansion

Munasinghe, Manisha 1 ; Read, Andrew1 ; Stitzer, Michelle2 ; Song, Baoxing3 ; Menard, Claire1 ; Ma, Kristy Yubo4 ; Brandvain, Yaniv 1 ; Hirsch, Candice1 ; Springer, Nathan1

Published Mar 02, 2023; Updated Nov 20, 2023 on Dryad. https://doi.org/10.5061/dryad.5qfttdz9t

Data files

Mar 02, 2023 version files 55.83 GB

NAM_AnchorWave_Alignments_GVCF.tar.gz

25.49 GB
NAM_AnchorWave_Alignments_MAF.tar.gz

29.30 GB
polymorphic_gene_calls.tar.gz

87.85 MB
polymorphic_TE_calls.tar.gz

955.93 MB
README.md

10.41 KB
SNP_density_summaries.tar.gz

4.36 MB

Oct 03, 2023 version files 55.99 GB

NAM_AnchorWave_Alignments_GVCF.tar.gz

25.49 GB
NAM_AnchorWave_Alignments_MAF.tar.gz

29.30 GB
polymorphic_gene_calls.tar.gz

87.85 MB
polymorphic_TE_calls.tar.gz

955.93 MB
README.md

11.94 KB
SNP_density_summaries.tar.gz

4.36 MB
summarised_AnchorWave_Regions.tar.gz

154.81 MB

Nov 20, 2023 version files 56.10 GB

B73_Structural_Variant_Category_Info.tsv.tar.gz

54.21 MB
NAM_AnchorWave_Alignments_GVCF.tar.gz

25.49 GB
NAM_AnchorWave_Alignments_MAF.tar.gz

29.30 GB
NAM_Structural_Variant_Category_Info.tsv.tar.gz

54.40 MB
polymorphic_gene_calls.tar.gz

87.85 MB
polymorphic_TE_calls.tar.gz

955.93 MB
README.md

13.26 KB
SNP_density_summaries.tar.gz

4.36 MB
summarised_AnchorWave_Regions.tar.gz

154.81 MB

Abstract

Background – Structural differences between genomes are a major source of genetic variation that contributes to phenotypic differences. Transposable elements, mobile genetic sequences capable of increasing their copy number and propagating themselves within genomes, can generate structural variation. However, their repetitive nature makes it difficult to characterize fine-scale differences in their presence at specific positions, limiting our understanding of their impact on genome variation. Domesticated maize is a particularly good system for exploring the impact of transposable element proliferation as over 70% of the genome is annotated as transposable elements. High-quality transposable element annotations were recently generated for de-novo genome assemblies of 26 diverse inbred maize lines.

Results – We generated base-pair resolved pairwise alignments between the B73 maize reference genome and the remaining 25 inbred maize line assemblies. From this data, we classified transposable elements as either shared or polymorphic in a given pairwise comparison. Our analysis uncovered substantial structural variation between lines, representing both putative insertion and deletion events. Putative insertions in SNP-depleted regions, which represent recently diverged identity by state blocks, suggest some TE families may still be active. However, our analysis reveals that genome-wide, deletions of transposable elements account for more structural variation than insertions. These deletions are often large structural variants containing multiple transposable elements.

Conclusions – Combined, our results highlight how transposable elements contribute to structural variation and demonstrate that deletion events are a major contributor to genomic differences.

Structural differences between genomes are a major source of genetic variation. Transposable elements, or TEs, are mobile genetic sequences capable of increasing their copy number and propagating themselves within genomes. TEs can generate structural variation as a result of their transposition. Advancements in long-read sequencing technologies and computation methods have enhanced our ability to characterize and investigate structural variation between genomes. Detailed characterization of multiple haplotypes for several loci in domestic maize (Zea mays ssp. mays) revealed extensive structural polymorphisms for TE content. Given the high TE content of the maize genome, it is likely that TEs are a major contributor to structural variants (SVs), but this has yet to be fully quantified.

Recently, high-quality TE annotations were recently generated for de-novo genome assemblies of 26 diverse inbred maize lines used to generate the nested association mapping (NAM) population. Using these genome assemblies, we generated base-pair resolved pairwise alignments between the B73 maize reference genome and the remaining 25 inbred maize line assemblies. From this data, we classified features (either TEs or genes) as either shared or polymorphic in a given pairwise comparison. This data repository contains several relevant datasets both generated and used in our analysis. We are sharing this data here publicly for use by any interested parties. Scripts used in this analysis can be found separately at https://github.com/mam737/PolymorphicTEs\_NAM.

Description of the data and file structure

Data included:

(1) Pairwise alignments between the B73 maize reference genome and the remaining 25 inbred maize line assemblies. This data was generated using AnchorWave (v1.0.1). The MAFToGVCF plugin of tassel v5.2.82 was used to reformat genome alignments in MAF format outputted by AnchorWave into variant calling records in GVCF format. Filenames for these datasets have the following structure - NAM_Line_Comparison.file_format.gz. So, B97.maf.gz represents the MAF output for the pairwise alignment between the B73 reference genome and the B97 inbred maize line. There are 25 maf.gz and 25 gvcf.gz files. Introductions to MAF file formats can be found here - https://docs.gdc.cancer.gov/Data/File\_Formats/MAF\_Format/, and introductions to GVCF file formats can be found here - https://gatk.broadinstitute.org/hc/en-us/articles/360035531812-GVCF-Genomic-Variant-Call-Format.

(2) Summarised AnchorWave Regions - GVCFs generated by AnchorWave were subsequently parsed to break the alignment down into 'alignable', 'structural variant', or 'unalignable' sequences. Given our interest in shared or polymorphic TEs, we condensed the GVCF output by combining the nonvariant sites, single nucleotide variants, and small (<50bp) insertion/deletion events into a single class of 'alignable' regions. 'Structural variants' in one genotype relative to another were defined as regions >50bp for which the size in the other genotype is 0bp. The remaining variants all include at least one base pair in each genotype that is not fully aligned, and these regions were consequently classified as 'unalignable'. 'Unalignable' regions also include gaps in the AnchorWave alignment. Within this zipped file, there are 25 folders referring to each of the pairwise AnchorWave alignments. In each folder, there are 10 files, one for each chromosome of the alignment. Column structure is as follows

Column 1 - B73 Chromosome
Column 2 - B73 Start Position
Column 3 - B73 End Position
Column 4 - NAM Chromosome
Column 5 - NAM Start Position
Column 6 - NAM End Position
Column 7 - Region Type (either 'alignable_region', 'structural_insertion_inB73', 'structural_insertion_inNAM', or 'unalignable')
Column 8 - Unique Region Identifier for this AnchorWave Alignment

(3) Polymorphic TE Calls. By intersecting publicly available TE annotations generated using EDTA with our pairwise alignments, we could classifiy any TE annotation as either shared or polymorphic in the given comparison. The pairwise nature of these alignments means that the B73 TE annotations would be classified 25 times relative to each NAM line, while each NAM line TE annotation would be classified once relative to B73. polymorphic_TE_calls.tar.gz contains all TE annotation classifications. Within this folder, there are two subfolders - B73 or NAM. B73 houses the 25 B73 TE annotations, while NAM houses the 25 NAM TE annotations. Within each folder, the file name structure is as follows B73_NAM_TE_classification_by_n.tsv where NAM indicates which inbred line is being compared. So, the file ~/polymorphic_TE_calls/B73/B73_B97_TE_classification_by_n.tsv classifies every B73 TE as either shared, polymorphic, or ambiguous (could not be classified as either shared or polymorphic) relative to the B73 versus B97 genome pairwise alignment. Column structure is the same regardless of whether you are looking at the B73 classifications or the compared NAM classifications.

Column 1 - TE_name - unique identifier for that TE annotation
Column 2 - chr - chromsomal location of that TE annotation
Column 3 - start - start base pair position of that TE annotation
Column 4 - end - end base pair position of that TE annotation
Column 5 - method - indicates whether annotation was identified using a structural approach (structural) or homology-based approach (homology)
Column 6 - type - denotes the identified type of TE annotation
Column 7 - class - denotes whether the TE annotation is a Class I or Class II TE
Column 8 - raw_superfamily - denotes the superfamily assigned by the EDTA TE annotation software
Column 9 - upd_superfamily - raw_superfamily names were transformed into shorted codes to provide updated superfamily names
Column 10 - condense_superfamily - further condensed superfamily names
Column 11 - raw_family - family name associated with TE annotation
Column 12 - alignable_region - proportion of TE annotation that overlaps alignable regions in pairwise alignment
Column 13 - structural_insertion_inB73 - proportion of TE annotation that overlaps structural variants with sequence in B73 (this column is entirely 0 for NAM TE annotations)
Column 15 - structural_insertion_inNAM - proportion of TE annotation that overlaps structural variants with sequence in NAM (this column is entirely 0 for B73 TE annotations)
Column 16 - unalignable - proportion of TE annotation that overlaps unalignable sequence identified from pairwise alignment
Column 17 - Missing_Data - proportion of TE annotation that overlaps gaps in the pairwise alignment
Column 18 - AW_Blocks - Unique identifier of AnchorWave blocks that TE annotation overlaps
Column 19 - classification - final classification group assigned (shared, polymorphic, or ambiguous)

(4) Polymorphic Gene Calls. By intersecting publicly available gene annotations with our pairwise alignments, we could classify any gene annotation as either shared or polymorphic in the given comparison. The pairwise nature of these alignments means that the B73 gene annotations would be classified 25 times relative to each NAM line, while each NAM line gene annotation would be classified once relative to B73. polymorphic_gene_calls.tar.gz contains all gene annotation classifications. Within this folder, there are two subfolders - B73 or NAM. Genes could be classified using either an exon-only or full-length approach. B73 houses the 25 B73 gene annotations, while NAM houses the 25 NAM gene annotations. Separate folders are used to house either the by_exon or by_full length analysis. Within each folder, the file name structure is as follows B73_NAM_gene_classification_by_type.tsv where NAM indicates which inbred line is being compared and type indicates whether it is an exon-only or full-length call. So, the file ~/polymorphic_gene_calls/B73/exon_calls/B73_B97_gene_classification_by_exon.tsv classifies every B73 gene as either shared, polymorphic, or ambiguous (could not be classified as either shared or polymorphic) relative to the B73 versus B97 genome pairwise alignment taking an exon-only gene approach. Column structure is the same regardless of whether you are looking at the B73 classifications or the compared NAM classifications.

Column 1 - id_name - unique identifier for that gene annotation
Column 2 - chr - chromsomal location of that gene annotation
Column 3 - start - start base pair position of that gene annotation
Column 4 - end - end base pair position of that gene annotation
Column 5 - exon_num - number of exons for that gene (This column is ONLY present for exon-only approach)
Column 6 - alignable_region - proportion of gene annotation that overlaps alignable regions in pairwise alignment
Column 7 - structural_insertion_inB73 - proportion of gene annotation that overlaps structural variants with sequence in B73 (this column is entirely 0 for NAM gene annotations)
Column 8 - structural_insertion_inNAM - proportion of gene annotation that overlaps structural variants with sequence in NAM (this column is entirely 0 for B73 gene annotations)
Column 9 - unalignable - proportion of gene annotation that overlaps unalignable sequence identified from pairwise alignment
Column 10 - Missing_Data - proportion of gene annotation that overlaps gaps in the pairwise alignment
Column 11 - AW_Blocks - Unique identifier of AnchorWave blocks that gene annotation overlaps
Column 12 - classification - final classification group assigned (shared, polymorphic, or ambiguous)

(5) Sliding window normalized SNP rate values derived from pairwise alignments. From each pairwise alignment, the number of SNPs and amount of alignable sequence could be determined. We used a sliding window approach to count the number of SNPs and base pairs of alignable sequences in 1Mb windows offset by 250kb from the start to end of each chromosome. Normalised SNP counts for each 1Mb window were then determined by dividing the SNP count by the total amount of alignable sequence in the 1Mb window. These sliding window normalized SNP counts can be found in SNP_density_summaries.tar.gz. There are 25 files within here that calculate these normalized SNP counts from each pairwise alignment. Column structure is the same across all 25 files.

Column 1 - chr - chromosome position of sliding window
Column 2 - BinStart - start position in B73 of sliding window
Column 3 - BinEnd - end position in B73 of sliding window
Column 4 - SNP_count - number of identified SNPs in the sliding window
Column 5 - nonSV_bp - number of non-structural variant base pairs in B73 for that region
Column 6 - norm_SNP_count - normalized SNP count for that region calculated by taking the raw SNP count and dividing by the amount of nonSV sequence in that window.

(6) Structural Variant Category and TE Relationship. We classified every structural variant (SV) depending on how TE annotations overlapped the SV. SVs could be categorized as either: "No TE SV", "Incomplete TE SV", "TE = SV", "Multi TE SV", or "TE Within SV". We include two data files: B73_Structural_Variant_Category_Info.tsv and NAM_Structural_Variant_Category_Info.tsv which contains the category information for either all of the SVs called with sequence in B73 relative to another NAM line or all of the SVs called with sequence in another NAM line relative to B73 respectively. For any SV that overlaps a TE, the TE names are listed. This can be linked to the TE_name in the polymorphic_TE_calls to obtain details on the annotation. Column structure is as follows

Column 1 - AW_BlockID - The unique region identifier for this AnchorWave Structural Variant
Column 2 - Lineage_Comp - The NAM line comparison from which this was SV was called.
Column 3 - chr - The chromosome where the SV was located
Column 4 - start - The start position of the SV
Column 5 - end - The end position of the SV
Column 6 - Category - Which category this SV falls into: "No TE SV", "Incomplete TE SV", "TE = SV", "Multi TE SV", or "TE Within SV".
Column 7 - Overlapping_TEs - A list of the TEs that overlap the SV

Sharing/Access information

Please contact mmunasin@umn.edu if you have any difficulties accessing the data or if you have any questions. We are very happy to assist folks in using this data. Publicly available datasets used in this analysis were found on MaizeGDB. Genome assemblies were found https://www.maizegdb.org/genome. TE and gene annotations were downloaded from https://maizegdb.org/NAM\_project, while synteny classifications for the NAM genes were downloaded from https://ars-usda.app.box.com/v/maizegdb-public/folder/186350887665.

Code/Software

Scripts used to filter and analyze data are available on GitHub at https://github.com/mam737/PolymorphicTEs\_NAM.To visualize pairwise alignments with overlapping TE and gene annotations, an R Shiny App Web Browser was developed and can be found at https://mmunasin.shinyapps.io/nam\_sv/.

Combined analysis of transposable elements and structural variation in maize genomes reveals genome contraction outpaces expansion

Data files

Abstract

README: Combined analysis of transposable elements and structural variation in maize genomes reveals genome contraction outpaces expansion

Description of the data and file structure

Sharing/Access information

Code/Software

Methods

Works referencing this dataset