# Combined analysis of transposable elements and structural variation in maize genomes reveals genome contraction outpaces expansion --- Structural differences between genomes are a major source of genetic variation. Transposable elements, or TEs, are mobile genetic sequences capable of increasing their copy number and propagating themselves within genomes. TEs can generate structural variation as a result of their transposition. Advancements in long-read sequencing technologies and compuation methods have enhanced our ability to characterize and investigate structural variation between genomes. Detailed characterization of multiple haplotypes for several loci in domestic maize (Zea mays ssp. mays) revealed extensive structural polymorphisms for TE content. Given the high TE content of the maize genome, it is likely that TEs are a major contributor to structural variants (SVs), but this has yet to be fully quantified. Recently, high-quality TE annotations were recently generated for de-novo genome assemblies of 26 diverse inbred maize lines used to generate the nested association mapping (NAM) population. Using these genome assemblies, we generated base-pair resolved pairwise alignments between the B73 maize reference genome and the remaining 25 inbred maize line assemblies. From this data, we classified features (either TEs or genes) as either shared or polymorphic in a given pairwise comparison. This data repository contains several relevant datasets both generated and used in our analysis. We are sharing this data here publicly for use by any interested parties. Scripts used in this analysis can be found separately at https://github.com/mam737/PolymorphicTEs_NAM. ## Description of the data and file structure Data included: (1) Pairwise alignments between the B73 maize reference genome and the remaining 25 inbred maize line assemblies. This data was generated using AnchorWave (v1.0.1). The MAFToGVCF plugin of tassel v5.2.82 was used to reformat genome alignments in MAF format outputted by AnchorWave into variant calling records in GVCF format. Filenames for these datasets have the following structure - NAM_Line_Comparison.file_format.gz. So, B97.maf.gz represents the MAF output for the pairwise alignment between the B73 reference genome and the B97 inbred maize line. There are 25 maf.gz and 25 gvcf.gz files. Introductions to MAF file formats can be found here - https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/, and introductions to GVCF file formats can be found here - https://gatk.broadinstitute.org/hc/en-us/articles/360035531812-GVCF-Genomic-Variant-Call-Format. (2) Polymorphic TE Calls. By intersecting publicly available TE annotations generated using EDTA with our pairwise alignments, we could classifiy any TE annotation as either shared or polymorphic in the given comparison. The pairwise nature of these alignments means that the B73 TE annotations would be classified 25 times relative to each NAM line, while each NAM line TE annotation would be classified once relative to B73. polymorphic_TE_calls.tar.gz contains all TE annotation classifications. Within this folder, there are two subfolders - B73 or NAM. B73 houses the 25 B73 TE annotations, while NAM houses the 25 NAM TE annotations. Within each folder, the file name structure is as follows B73_NAM_TE_classification_by_n.tsv where NAM indicates which inbred line is being compared. So, the file ~/polymorphic_TE_calls/B73/B73_B97_TE_classification_by_n.tsv classifies every B73 TE as either shared, polymorphic, or ambiguous (could not be classified as either shared or polymorphic) relative to the B73 versus B97 genome pairwise alignment. Column structure is the same regardless of whether you are looking at the B73 classifications or the compared NAM classifications. Column 1 - TE_name - unique identifier for that TE annotation Column 2 - chr - chromsomal location of that TE annotation Column 3 - start - start base pair position of that TE annotation Column 4 - end - end base pair position of that TE annotation Column 5 - method - indicates whether annotation was identified using a structural approach (structural) or homology based approach (homology) Column 6 - type - denotes the identified type of TE annotation Column 7 - class - denotes whether the TE annotation is a Class I or Class II TE Column 8 - raw_superfamily - denotes the superfamily assigned by the EDTA TE annotation software Column 9 - upd_superfamily - raw_superfamily names were transformed into shorted codes to provide updated superfamily names Column 10 - condense_superfamily - further condensed superfamily names Column 11 - raw_family - family name associated with TE annotation Column 12 - alignable_region - proportion of TE annotation that overlaps alignable regions in pairwise alignment Column 13 - structural_insertion_inB73 - proportion of TE annotation that overlaps structural variants with sequence in B73 (this column is entirely 0 for NAM TE annotations) Column 15 - structural_insertion_inNAM - proportion of TE annotation that overlaps structural variants with sequence in NAM (this column is entirely 0 for B73 TE annotations) Column 16 - unalignable - proportion of TE annotation that overlaps unalignable sequence identified from pairwise alignment Column 17 - Missing_Data - proportion of TE annotation that overlaps gaps in the pairwise alignment Column 18 - AW_Blocks - Unique identifier of AnchorWave blocks that TE annotation overlaps Column 19 - classification - final classification group assigned (shared, polymorphic, or ambiguous) (3) Polymorphic Gene Calls. By intersecting publicly available gene annotations with our pairwise alignments, we could classifiy any gene annotation as either shared or polymorphic in the given comparison. The pairwise nature of these alignments means that the B73 gene annotations would be classified 25 times relative to each NAM line, while each NAM line gene annotation would be classified once relative to B73. polymorphic_gene_calls.tar.gz contains all gene annotation classifications. Within this folder, there are two subfolders - B73 or NAM. Genes could be classified using either an exon-only or full-length approach. B73 houses the 25 B73 gene annotations, while NAM houses the 25 NAM gene annotations. Separate folders are used to house either the by_exon or by_full length analysis. Within each folder, the file name structure is as follows B73_NAM_gene_classification_by_type.tsv where NAM indicates which inbred line is being compared and type indicates whether it is an exon-only or full-length call. So, the file ~/polymorphic_gene_calls/B73/exon_calls/B73_B97_gene_classification_by_exon.tsv classifies every B73 gene as either shared, polymorphic, or ambiguous (could not be classified as either shared or polymorphic) relative to the B73 versus B97 genome pairwise alignment taking an exon-only gene approach. Column structure is the same regardless of whether you are looking at the B73 classifications or the compared NAM classifications. Column 1 - id_name - unique identifier for that gene annotation Column 2 - chr - chromsomal location of that gene annotation Column 3 - start - start base pair position of that gene annotation Column 4 - end - end base pair position of that gene annotation Column 5 - exon_num - number of exons for that gene (This column is ONLY present for exon-only approach) Column 6 - alignable_region - proportion of gene annotation that overlaps alignable regions in pairwise alignment Column 7 - structural_insertion_inB73 - proportion of gene annotation that overlaps structural variants with sequence in B73 (this column is entirely 0 for NAM gene annotations) Column 8 - structural_insertion_inNAM - proportion of gene annotation that overlaps structural variants with sequence in NAM (this column is entirely 0 for B73 gene annotations) Column 9 - unalignable - proportion of gene annotation that overlaps unalignable sequence identified from pairwise alignment Column 10 - Missing_Data - proportion of gene annotation that overlaps gaps in the pairwise alignment Column 11 - AW_Blocks - Unique identifier of AnchorWave blocks that gene annotation overlaps Column 12 - classification - final classification group assigned (shared, polymorphic, or ambiguous) (4) Sliding window normalized SNP rate values derived from pairwise alignments. From each pairwise alignment, the number of SNPs and amount of alignable sequence could be determined. We used a sliding window approach to count the number of SNPs and base pairs of alignable sequence in 1Mb windows offset by 250kb from the start to end of each chromosome. Normalied SNP counts for each 1Mb window were then determined by dividing the SNP count by the total amount of alignable sequence in the 1Mb window. These sliding window normalized SNP counts can be found in SNP_density_summaries.tar.gz. There are 25 files within here that calculate these normalized SNP counts from each pairwise alignment. Column structure is the same across all 25 files. Column 1 - chr - chromosome position of sliding window Column 2 - BinStart - start position in B73 of sliding window Column 3 - BinEnd - end position in B73 of sliding window Column 4 - SNP_count - number of identified SNPs in the sliding window Column 5 - nonSV_bp - number of non-structural variant base pairs in B73 for that region Column 6 - norm_SNP_count - normalized SNP count for that region calculated by taking the raw SNP count and dividing by the amount of nonSV sequence in that window. ## Sharing/Access information Please contact mmunasin@umn.edu if you have any difficulties accessing the data or if you have any questions. We are very happy to assist folks in using this data. Publicly available datasets used in this analysis were found on MaizeGDB. Genome assemblies were found https://www.maizegdb.org/genome. TE and gene annotations were downloaded from https://maizegdb.org/NAM_project, while synteny classifications for the NAM genes were download from https://ars-usda.app.box.com/v/maizegdb-public/folder/186350887665. ## Code/Software Scripts used to filter and analyze data are available on GitHub at https://github.com/mam737/PolymorphicTEs_NAM.To visualize pairwise alignments with overlapping TE and gene annotations, an R Shiny App Web Browser was developed and can be found at https://mmunasin.shinyapps.io/nam_sv/ .