Genome report: Genome sequence of 1S1, a transformable and highly regenerable diploid potato for use as a model for gene editing and genetic engineering
Jayakody, Thilani; Buell, C. Robin (2023), Genome report: Genome sequence of 1S1, a transformable and highly regenerable diploid potato for use as a model for gene editing and genetic engineering, Dryad, Dataset, https://doi.org/10.5061/dryad.5x69p8d70
Generation of a genomic resource for a readily transformable diploid potato would provide a resource for high throughput functional analysis in potato. The heterozygous Solanum tuberosum Group Phureja clone 1S1 has a high regeneration rate, self-fertility, desirable tuber traits and is amenable to Agrobacterium-mediated transformation. To create a contiguous genome assembly, a homozygous doubled monoploid of 1S1 (DM1S1) was sequenced using 44 Gbp of long reads generated from Oxford Nanopore Technologies (ONT), yielding a 736 Mb assembly that encoded 31,145 protein-coding genes. The final assembly for DM1S1 represents a nearly complete genic space, shown by the presence of 99.6% (C:99.5%[S:97.8%, D:1.7%],F:0.1%,M:0.4%,n:1614) of the Benchmarking Universal Single Copy Orthologs. Variant analysis with Illumina reads from 1S1 was used to deduce its alternate haplotype using the variant calling tools Strelka2 (v2.9.10), GATK’s Haplotypecaller (v188.8.131.52), and Freebayes (v1.3.2). These variants were used to create consensus fasta sequences with the DM1S1 assembly using bcftools (v1.9.64).
DNA isolation and library preparation
Genomic DNA for Oxford Nanopore Technologies (ONT) sequencing was isolated, purified and size selected, from greenhouse-grown leaves of DM1S1 as described previously (Vaillancourt and Buell 2019). Short sequences were removed using the Ciruclomic’s Short Read Eliminator Kit (Circulomics, Baltimore, MD, Cat #SS-100-101-01). Eleven ONT DNA libraries were prepared using the ONT SQK-LSK109 Ligation Sequencing kit and sequenced on six R9 ONT FLO-MIN106 Rev D flow cells. Five of these R9 ONT flow cells were washed and reused according to the Flow Cell Wash kit and protocol (EXP-WSH003, version: WFC_9088_v1_revB_18Sept2019). Sequencing was performed using default settings on an ONT (Oxford, UK) MinION (MinIon 19.12.5 or 19.10.1) using MinKNOW default settings (MinKNOW v3.5.5, v3.6.0, v3.6.5).
Genomic DNA for whole-genome shotgun sequencing (WGS) was isolated from young leaves of tissue culture-grown DM1S1 and 1S1 clones using the DNeasy Plant Mini Kit (Qiagen, Hilden, Germany). Illumina TruSeq Nano DNA WGS libraries were prepared and multiplexed using IDT Illumina Unique Dual Index adapters, then sequenced on an Illumina HiSeq 4000 in paired-end mode by the Michigan State University Genomics Core, generating 150 nt reads.
RNA isolation and library construction
RNA from leaf and tuber tissue was isolated using a modified hot borate protocol (Wan, C. Y., & Wilkins, T. A., 1994) and DNase treated using the Ambion Turbo DNA-freeä Kit (ThermoFisher Scientific, Waltham, MA). Quality RNA, as determined via Qubit, Nanodrop and gel electrophoresis, was used to isolate mRNA using the Dynabeads mRNA Purification Kit (ThermoFisher Scientific, Waltham, MA, Cat #61011). mRNA was input in the Oxford Nanopore Technologies (ONT) SQK-PCS-109 kit and used to generate full-length cDNA libraries. Resultant libraries were sequenced on ONT R9 FLO-MIN106 Rev D flow cells.
ONT gDNA reads were base-called using default parameters with Guppy v3.5.1 (https://community.nanoporetech.com/downloads) using a NVIDIA V100 GPU with the dna_r9.4.1_450bps_hac.cfg configuration file. Reads that passed the base caller quality were then filtered to retain reads larger than 10 kb using an awk script (https://github.com/Thilanij/Public/blob/main/10kb_read_filter.awk) yielding a final set of 1,501,797 reads with a total size of 43.9 Gb and ∼52× coverage.
Contigs were assembled from the final set of ONT reads using Flye v2.4.2 (Kolmogorov et al. 2019) with the parameters –nano-raw -g 850m. The initial assembly was then polished using the final set of ONT reads with two iterations of Racon v1.3.2 (Vaser et al. 2017). For each iteration, the reads were mapped to the assembly using minimap2 v2.13 (Li et al. 2018) with the parameter -x map-ont, then polished with the read alignments using Racon with the -u parameter set. The assembly was further polished using the final set of ONT reads with 1 round of Medaka v.0.12.1 (https://community.nanoporetech.com/downloads). Final polishing was performed with Illumina WGS reads (DM1S1_AA) using three rounds of Pilon v 1.23 (Walker et al. 2014). The Illumina reads were processed by Cutadapt v2.1 (Martin 2011) to remove adapters and to trim low-quality regions with the parameters -n 3 -m 100 -q 30,30. For each iteration, the cleaned reads were aligned to the assembly using Bowtie2 v184.108.40.206 (Langmead and Salzberg 2012), and the alignments sorted with SAMtools v1.10 (Li et al. 2009). Pilon was run using the following parameters --fix all --changes --frags. Contigs were scaffolded using a reference-guided approach with Ragtag v1.0.2 (Alonge et al. 2021) using default parameters and the DM v6.1 assembly as the reference (Pham et al. 2020). Benchmarking Universal Single Copy Orthologs (BUSCO) v5.2.2 (Simão et al. 2015) was used to assess the quality of the final assembly with the orthologs from the EmbyophytaDB V10 dataset (n=1614). To assess completeness of assembly in relation to the potato reference, high-confidence gene models from the DM v6.1 reference were aligned to DM1S1 using Minimap2 v2.13 in splice aware model
The final genome assembly was repeat masked with RepeatMasker v4.1.0 (Flynn et al. 2020) using the DM v6.1 custom repeat library (Pham et al. 2020) using the parameters: -s -nolow -no_is -gff. Ab initio gene predictions were made using Augustus v3.3.3 (Stanke et al. 2006) with the DM v6.1 training matrix and the softmasked assembly. The Nanopore cDNA reads for each library were processed with Pychopper v2.4.0 (https://github.com/nanoporetech/pychopper) and aligned to the assembly using Minimap2 v2.17-r941 (Li et al. 2018) using the parameters: -ax splice -uf -G 5000. The cDNA alignments were assembled into transcript assemblies using Stringtie v 2.1.4 (Kovaka et al. 2019) with the parameters: -L -m 500. The ab initio gene predictions were refined using two rounds of PASA2 v2.4.1 (Haas et al. 2003) using the transcript assemblies for each cDNA library yielding the set of working gene models. The identification of high-confidence gene models and the assignment of functional annotation was performed as described in Pham et al. 2020.
Whole-genome sequence reads were cleaned using Cutadapt (v2.1) using a minimum base quality of 35 and a minimum read length of 50 bp after trimming. Cleaned fastq reads were converted to an unmapped BAM using picard FastqtoSam and adapter sequences were marked using picard’s Mark Illumina Adapter and SamToFastq, with CLIPPING_ATTRIBUTE=XT and CLIPPING_ACTION=2 (v2.18.27) (Picard Tools). Genomic reads were mapped to the DM1S1 assembly in paired end mode, flagging secondary hits (-M), using BWA-MEM (v0.7.17) (Li, 2013), and then filtered to only keep properly paired reads using SAMtools (v1.7). MergeBamAlignment was used to restore and adjust metadata as well as allow for any number of insertion or deletion mutations by setting MAX_INSERTIONS_OR_DELETIONS = -1. Duplicate reads were marked using Picard’s MARKDuplicates. Reads surrounding insertion/deletions were identified and realigned using GATK’s (v3.8.1) (McKenna et al. 2010) RealignerTargetCreator and IndelRealigner, respectively. Strelka2 (v2.9.10) (Kim et al., 2018), GATK’s Haplotypecaller (v220.127.116.11) ( McKenna et al., 2010), and Freebayes (v1.3.2) (Garrison and Marth, 2012) were used to call germline variants. Strelka2 was run on default parameters, and variants that did not pass the following thresholds set by Strelka2 were removed: IndelConflict, SiteConflict, LowGQX, HighDPFRatio, HighSNVB, and HighDepth. Haplotypecaller was run using the –min-base-quality-score 20 parameter. Freebayes was run using the following parameters, -C 4, --min-mapping-quality 30, --min-base-quality 30. All variants were hard filtered to remove any multiallelic sites, and only calls for the alternate allele were kept. These variants were used to create consensus fasta sequences with the DM1S1 assembly using bcftools (v1.9.64).
National Science Foundation, Award: DGE-1828149
U.S. Department of Agriculture, Award: 2018-33522- 28736