Skip to main content
Dryad

Data from: Structural variation and its potential impact on genome instability: novel discoveries in the EGFR landscape by long-read sequencing

Cite this dataset

Emerson, Lyska L.; Cook, George W. (2021). Data from: Structural variation and its potential impact on genome instability: novel discoveries in the EGFR landscape by long-read sequencing [Dataset]. Dryad. https://doi.org/10.5061/dryad.n7kp793

Abstract

Studies of structural variation (SV) have been challenging due to technological contraints. With the advent of third generation (long-read) sequencing technology, exploration of longer stretches of DNA not easily examined previously has been made possible. In the present study, we utilized third generation (long-read) sequencing techniques to examime SV in the EGFR landscape of four haplotypes derived from two human samples. We analyzed the EGFR gene and its landscape (+/- 500,000 base pairs) using this sequencing approach and were able to identify regions of non-coding DNA which had relatively high similarity to the most common activating EGFR mutation in non-small cell lung cancer. We discovered that reverse complements to the exon 19 deletion mutation which had at least 60% homology to the EGFR exon 19 canonical deletion and were within ± 421,000 bp of the deletion varied across the five haploid genomes examined (4 patient landscapes and hg38). Although the sample size is limited in this study, the estimated variation observed in genomic stability between the five EGFR haplotypes examined is novel and encourages further work to examine structural variation in larger cohorts.

Methods

Targeted capture of long DNA fragments were obtained from each sample and submitted for long-read sequencing (Pacific Biosciences, SMRT® Sequencing with Sequel Binding Kit 2.0, Sequel Sequencing Plate 2.1).

After primary analysis, Circular Consensus Sequence (CCS) reads were generated using SMRT Analysis 6.0.0 for each dataset and aligned to the GRCh38 reference genome using minimap2. PCR duplicates from post-capture amplification were identified by mapping endpoints and tagged using a custom script (https://github.com/williamrowell/markdup). Short variants were joint-called using GATK4 HaplotypeCaller and filtered by quality metrics (SNPs and >1bp indels, remove variants with QD<2; 1bp indels, remove variants with QD<5). The SNP sites that passed filtration were used in conjunction with the deduplicated CCS alignments for read-backed phasing with WhatsHap. We generated consensus sequences by applying phased SNP information to the reference FastA sequence using the VCFtools package. For each sample, the analysis results in regions with phased DNA haplotype landscapes (“haplotype separated”) separated by regions with one DNA landscape where the variants cannot be phased (“collapsed”).

Usage notes