Skip to main content

A fast machine-learning-guided primer design pipeline for selective whole genome amplification

Cite this dataset

Dwivedi-Yu, Jane et al. (2022). A fast machine-learning-guided primer design pipeline for selective whole genome amplification [Dataset]. Dryad.


Addressing many of the major outstanding questions in the fields of microbial evolution and pathogenesis will require analyses of populations of microbial genomes. Although population genomic studies provide the analytical resolution to investigate evolutionary and mechanistic processes at fine spatial and temporal scales – precisely the scales at which these processes occur – microbial population genomic research is currently hindered by the practicalities of obtaining sufficient quantities of the relatively pure microbial genomic DNA necessary for next-generation sequencing. Here we present swga2.0, an optimized and parallelized pipeline to design selective whole genome amplification (SWGA) primer sets. Unlike previous methods, swga2.0 incorporates active and machine learning methods to evaluate the amplification efficacy of individual primers and primer sets. Additionally, swga2.0 optimizes primer set search and evaluates strategies, including parallelization at each stage of the pipeline, to dramatically decrease program runtime from weeks to minutes. Here we describe the swga2.0 pipeline, including the empirical data used to identify primer and primer set characteristics, that improve amplification performance. Additionally, we evaluated the novel swga2.0 pipeline by designing primers sets that successfully amplify Prevotella melaninogenica, an important component of the lung microbiome in cystic fibrosis patients, from samples dominated by human DNA.


Primer sets designed to amplify P. melaninogenica from samples dominated by human DNA were empirically tested to evaluate the efficacy of swga2.0. Six primer sets were created using P. melaninogenica strain ATCC 25845 as the target genome with the human genome (GRCh38.p13) as background. The six primer sets were evaluated in duplicate on purified P. melaninogenica DNA (strain ATCC 25845), diluted to 1% in purified human genomic DNA (Promega, female, catalog No. G1521). Briefly, the 1:99 target:background sample was digested with FspEI (New England Biolabs) according to the manufacturer's protocol. The digested sample was purified using AmpureXP beads (Beckman Coulter) prior to performing selective whole-genome amplification. Reactions were performed in a volume of 50 uL using 50 ng of digested DNA, SWGA primers (total concentration of all primers together = 3.5mM), 1x Phi29 buffer (New England Biolabs), 1 mM dNTPs, and 30 units Phi29 polymerase (New England Biolabs). Amplification conditions included a ramp-down from 35 to 31C (5 min at 35C, 10 min at 34C, 15 min at 33C, 20 min at 32C, 25 min at 31C), followed by a 16h amplification step at 30C. The polymerase was then denatured for 15 min at 65C. Amplified samples were purified using AmpureXP beads, prepared for Illumina sequencing, and sequenced on an Illumina MiSeq (150 bp, paired end). The unamplified sample was also sequenced to assess changes in sequencing coverage due to SWGA.


National Institute of Allergy and Infectious Diseases, Award: R21-AI137433

National Institute of General Medical Sciences, Award: R35-GM134922

National Institute of Allergy and Infectious Diseases, Award: R01-AI142572

Burroughs Wellcome Fund, Award: 1012376