Optimization of microhaplotypes for advanced DNA mixture deconvolution

Cavanaugh, Sarah 1 ; Feldman, Andrew2; Lin, Jeffrey2; Borchert, Lauren3; Fenske, Regan3; Kim, Aki3; Sriram, Trupti1; Bever, Robert1; Kidd, Kenneth4; Podini, Daniele3; Davoren, Jonathan1

Published Jan 22, 2026 on Dryad. https://doi.org/10.5061/dryad.k98sf7mmt

Data files

Jan 22, 2026 version files 544.60 KB

15pnij-22-gg-04393-dnax_AeTable.csv

19.68 KB
15pnij-22-gg-04393-dnax_LocusInfo.csv

4.46 KB
15pnij-22-gg-04393-dnax_Mixtures_PerDonorMetrics.csv

97.98 KB
15pnij-22-gg-04393-dnax_MockEvidence_PerDonorMetrics.csv

7.88 KB
15pnij-22-gg-04393-dnax_PopulationFrequencies.csv

394.79 KB
15pnij-22-gg-04393-dnax_SensitivityPerSample.csv

6.90 KB
README.md

12.92 KB

Abstract

Detection of minor DNA components in biological mixtures has increased as molecular techniques have become more sensitive. Accordingly, mixture deconvolution has become a major concern and topic of debate in the forensic DNA community. Short tandem repeat (STR) profile data generated with capillary electrophoresis and massively parallel sequencing (MPS) are subject to inherent issues that complicate mixture deconvolution. Deconvolution may be improved by sequencing microhaplotypes as they are not subject to the amplification noise artifacts and stochastic effects that impact STRs. Before microhaplotypes can be implemented in casework, the following considerations should be addressed: definition of a consistent panel of microhaplotype loci; increased population studies to determine relevant haplotype allele frequencies; incorporation of advanced sequencing technologies into forensic laboratories; development of user-friendly bioinformatic analysis and mixture deconvolution methods; and assessment of the infrastructure requirements necessary to build a searchable microhaplotype criminal database. In two phases, this study will optimize and assess an MPS workflow and analysis package for improved mixture deconvolution using microhaplotypes. Analysis will be performed with NexGenID, a novel software platform optimized for mixture deconvolution and probabilistic genotyping of sequence data. Phase I objectives will include evaluation and down-selection of microhaplotype loci optimal for individualization and mixture deconvolution; construction of wet-bench target assay; haplotyping of donor samples to obtain expanded population allele frequency data; and assessment of the projected performance of the microhaplotype allele calling analysis workflow. Phase II objectives will include evaluating the benefits and limitations of mixture deconvolution and probabilistic genotyping using the microhaplotype wet-bench assay with Illumina sequencing and NexGenID analysis by applying the workflow to in vitro mixtures and constructed mock evidence and also comparing outcomes from NexGenID to analysis of microhaplotype mixtures using a retrofitted version of EuroForMix. By coupling a highly discriminatory microhaplotype MPS assay with NexGenID, microhaplotype analysis can be efficiently implemented by practitioners. The proposed microhaplotype workflow has the potential to exceed minor-contributor detection when compared to STR deconvolution, help solve complex cases, increase the number of samples considered suitable for comparison, and enable retesting of cold cases where a minor contributor was assumed present but was not suitable for comparison.

DOI: 10.5061/dryad.k98sf7mmt

This data set contains locus information, assay sensitivity metrics, and mixture deconvolution results for the evaluation of a target sequencing microhaplotype assay.
This project was supported by Award No. 15PNIJ-22-GG-04393-DNAX, awarded by the National Institute of Justice, Office of Justice Programs, U.S. Department of Justice. The opinions, findings, and conclusions or recommendations expressed in this presentation are those of the authors and do not necessarily reflect those of the Department of Justice.

Description of the data and file structure

Tabular file 15pnij-22-gg-04393-dnax_LocusInfo.csv contains the chromosome locations for each of the 43 loci evaluated in this effort.

MH Name: Locus name following general convention established in Kidd 2016 (https://doi.org/10.1186/s40246-016-0078-y)
Chr#: Chromosome locations of the locus
Target Region Chr_Start (hg38): Chromosome position of the first allele-defining SNP of the locus, relative to the human reference build GRCh38
Target Region Chr_Stop (hg38): Chromosome position of the last allele-defining SNP of the locus, relative to the human reference build GRCh38
#Hap alleles in 35 pops: Total number of observed unique haplotypes in 35 populations evaluated, described below
#SNPs (MAF >1%): Total number of SNPs contained in the locus target region with a minor allele frequency (MAF) > 1% as designated in NCBI dbSNP 153
Ae Minimum: Minimum Ae value observed across all 35 populations
Ae Median: Median Ae value observed across all 35 populations
Ae Average: Average Ae value observed across all 35 populations
Ae Maximum: Maximum Ae value observed across all 35 populations
Rosenberg In: Rosenberg's measure of informativeness for the locus

Tabular file 15pnij-22-gg-04393-dnax_PopulationFrequencies.csv contains microhaplotype allele frequencies calculated for 1000 Genomes Phase 3 population donors (n=2504) and an additional 240 internally genotyped samples. Population descriptions and total N donors per populations are provided in the Ae Table file described below. The locus-defining SNP-based alleles are listed for each locus in column A, while the populations specific frequencies are contained in columns B - AJ.

Tabular file 15pnij-22-gg-04393-dnax_AeTable.csv details the calculated Ae value for each of the 35 populations at each locus:

World Region: Describes the general geographical regions where the population is historically located
N: total number of individuals evaluated from the population
Group Names: Specific descriptors of population groups comprising each population code
Population Code: Acronyms or names used to define a population group\
Columns D - AT contain Ae values for the microhaplotype locus named in row 3

Tabular file 15pnij-22-gg-04393-dnax_SensitivityPerSample.csv details sample metrics for the microhaplotype assay sensitivity analysis as follows:

Sample name: Given sample identification name, combination of donor code, DNA input, and replicate for assay testing
Donor Code: Anonymized internal donor code
DNA Input (ng): The total amount of DNA added to the AmpliSeq for Illumina microhaplotype target sequencing assay
Replicate: All samples were tested in triplicate. Replicates are lettered A, B, and C
Library Quant (ng/µl): The determined dsDNA concentration of the final, clean AmpliSeq microhaplotype library generated by each sample
AVG Library size (bp): Average fragment size observed for each sample library, determined by Agilent D1000 HS ScreenTapes
Total Reads: The total number of identified reads for the given sample barcode after Illumina MiSeq sequencing and demultiplexing, calculated using samtools flagstats.
Total Mapped Reads: The total number of demultiplexed reads for a given sample barcode mapped to the GRCh38 human genome reference, calculated using samtools flagstats.
Total Alleles: total number of unique alleles observed for the given donor across 43 loci. Homozygous genotypes counted as 1 allele, heterozygous genotypes counted as 2 alleles
Percent Profile: Percentage of total number of alleles obtained vs total number of alleles expected for the given donor
Max Locus Read Depth: Maximum number of reads obtained at a locus
Min Locus Read Depth: Minimum number of reads obtained at a locus
AVG Read depth across 43 loci
STD DEV: standard deviation of read depths across all 43 loci
Inter-locus balance: ratio of max locus read depth vs min locus read depth
Intra-locus allele balance: The average allele balance between alleles within a heterozygous locus
Heterozygosity: Ratio of heterozygous loci vs all sequenced loci in the sample, given as a percent

Tabular file 15pnij-22-gg-04393-dnax_Mixtures_PerDonorMetrics.csv details sample metrics for microhaplotype genotyping and deconvolution of constructed mixture as follows:

Mixture Category: Designates one of four mixture construction categories
Mixture Code: Sequentially numbered mixture ratio designation
Sample name: Given sample identification name, combination of mixture code, DNA input, and replicate for assay testing
Total DNA Input (ng): The total amount of DNA added to the AmpliSeq for Illumina microhaplotype target sequencing assay
Expected number of contributors: Number of known donors combined to create mixture
Observed number of contributors: Following deconvolution analysis, the possible number of contributors present in the mixture as indicated by the sequence data
Donor #: Contributors given a donor number based on expected proportion to mixture. Donor 1 - largest contribution, Donor 5 - lowest contribution
Donor Code: Anonymized internal donor code
Expected Proportion: Target contribution percentage during mixture construction
Observed Proportion: Estimated contributor proportion following deconvolution analysis
Donor alleles: total number of alleles observed in the mixture data for a given donor; max possible alleles = 86
Allele Dropout: number of alleles below analytical threshold or unobserved for a known contributor in sequence data
Detection Statistical Significance: Probability a given Person of Interest randomly matches an observed fraction of alleles in sequence results for a complex mixture. Threshold for detection <10E-6
Max Log(LR) NexGenID: Log10 of likelihood ratio as calculated in NexGenID software for hypothesis testing of LR = Pr(E|Hp)/Pr(E|Hd) where Hp = person with the reference genotype is a contributor to the sample DNA vs Hd = random person(s) from the population, unrelated to Person of Interest, are the sources of the DNA
High Confidence GT Correct: Total number of per locus allele pairings exceeding a genotype confidence threshold of 0.995 and correct relative to the known contributor
High Confidence GT Wrong: Total number of per locus allele pairings exceeding a genotype confidence threshold of 0.995 and incorrect relative to the known contributor
Correct below confidence threshold: Total number of per locus allele pairings that do not exceed a genotype confidence threshold of 0.995 but are correct relative to the known contributor
Max Log(LR) EuorForMix: Log10 of likelihood ratio as calculated in EuroForMix software for hypothesis testing of LR = Pr(E|Hp)/Pr(E|Hd) where Hp = person with the reference genotype is a contributor to the sample DNA vs Hd = random person(s) from the population, unrelated to Person of Interest, are the sources of the DNA
Observed Proportion STRs w/STRmix: Estimated constributor proportion following deconvolution analysis with STR genotypes in STRmix v2.9
Max Log(LR) STRmix: Log10 of likelihood ratio as calculated in STRmix software for hypothesis testing of LR = Pr(E|Hp)/Pr(E|Hd) where Hp = person with the reference genotype is a contributor to the sample DNA vs Hd = random person(s) from the population, unrelated to Person of Interest, are the sources of the DNA
n/a indicates no data available for that replicate for the given deconvolution software.

Tabular file 15pnij-22-gg-04393-dnax_MockEvidence_PerDonorMetrics.csv details sample metrics for microhaplotype genotyping and deconvolution of constructed mock evidence samples as follows:

Mixture substrate: substrate handled by donors to generate mock touch/trace evidence sample
Total DNA Input (ng): The total amount of DNA added to the Ampliseq for Illumina microhaplotype target sequencing assay
Expected number of contributors: Number of known donors combined to create a mixture
Observed number of contributors: Following the deconvolution analysis, the possible number of contributors present in the mixture, as indicated by the sequence data
Donor #: Contributors given a donor number based on the expected proportion to the mixture. Donor 1 - largest contribution, Donor 5 - lowest contribution
Expected Proportion: Target contribution percentage during mixture construction
Observed Proportion: Estimated contributor proportion following deconvolution analysis
Matched Donor Code: Anonymized internal donor code and matching to the observed proportion genotype or compared as PoI for LR calculation
Allele Detection above AT: Number of PoI alleles detected in the mixture data above the determined analytical threshold
Alleles Observed below AT: Number of PoI alleles present in the mixture data but represented by a read depth below analytical threshold
Allele Dropout: Number of PoI alleles with no representation in the mixture data, true allele loss
Detection Statistical Significance: Probability that a given Person of Interest randomly matches an observed fraction of alleles in sequence results for a complex mixture. Threshold for detection <10E-6
High Confidence GT Correct: Total number of per locus allele pairings exceeding a genotype confidence threshold of 0.995 and correct relative to the known contributor
High Confidence GT Wrong: Total number of per locus allele pairings exceeding a genotype confidence threshold of 0.995 and incorrect relative to the known contributor
Correct below confidence threshold: Total number of per locus allele pairings that do not exceed a genotype confidence threshold of 0.995 but are correct relative to the known contributor
Max Log(LR) NexGenID: Log10 of likelihood ratio as calculated in NexGenID software for hypothesis testing of LR = Pr(E|Hp)/Pr(E|Hd) where Hp = person with the reference genotype is a contributor to the sample DNA vs Hd = random person(s) from the population, unrelated to Person of Interest, are the sources of the DNA
Max Log(LR) EuorForMix: Log10 of likelihood ratio as calculated in EuroForMix software for hypothesis testing of LR = Pr(E|Hp)/Pr(E|Hd) where Hp = person with the reference genotype is a contributor to the sample DNA vs Hd = random person(s) from the population, unrelated to Person of Interest, are the sources of the DNA
n/a indicates no result available for that Donor for the given metric.
-- indicates missing data for the 5person handgun frame sample

Sharing/Access information

Links to other publicly accessible locations of the data: https://nij.ojp.gov/funding/awards/15pnij-22-gg-04393-dnax
A link to the Dryad dataset will also be accessible through the National Archive of Criminal Justice Data (NACJD): https://www.icpsr.umich.edu/web/pages/NACJD/.

Was the data derived from another source? Microhaplotype alleles for population frequency calculations were in part derived from the 1000 Genomes Project Phase 3 dataset (https://www.internationalgenome.org).

Human subjects data

No personally identifiable data is provided in these data. Microhaplotype allele frequencies calculated within a population classification are disclosed; however, no genotype information of individuals will be provided. All samples were collected under IRB or purchased from a biorepository as anonymized donors. IRB consent allows for publishing of de-identified data in the public domain, so long as individual genotypes are withheld. Upon collection, all donors were provided a sample archiving code to de-identify the source donor and then given a secondary project code for further de-identification in this project. Two commercially available samples often used as control samples for genotyping studies were also purchased for use in this project: Promega 2800M and NIST RM8391/NA24385. Lastly, the NIST RGTM samples were obtained from NIST. These are also publicly accessible samples that have been de-identified and consented to disclose genotype information; however, no genotype information is provided for these samples at this time.

An AmpliSeq for Illumina assay was developed to target amplify and sequence 43 forensically relevant microhaplotype loci, specifically selected for their potential application to complex mixture deconvoution. Target loci were selected from previously published work compiled in the MicroHapDB database [https://microhapdb.readthedocs.io/en/latest/].

A set of 240 test samples was identified from a donor collection of nearly 500 individuals previously collected under IRB and housed at GWU. These donors represent eleven biogeographical populations. Additional donors were identified from purchased blood bank samples previously obtained for internal validation projects.
For sensitivity testing of the final assay, 2800M Control DNA (Promega Corp, Madison, WI), NA24385 (NIST RM8391), and two purchased blood samples were serial diluted to test inputs of 2 ng, 1 ng, 0.5 ng, 0.1 ng, 0.05 ng, and 0.025 ng. Each dilution was evaluated in triplicate libraries.

A set of 149 complex mixtures were constructed in vitro from aliquots of the population donor samples described above. Mixtures contained between 2 and 5 contributors at disparate contribution proportions, and total DNA amounts of 0.5 ng, 1 ng, or 5 ng. Mixtures were constructed to meet the goals of the following four categories: Category 1 - minor contributor detection limits down to 1%; Category 2 - Estimating correct number of contributors when first degree relative pairs are present in the mixture; Category 3 - improvement to genotype separation over STRs when donors share STR alleles in stutter positions; and Category 4 - presence of donors with imbalanced degradation patterns induced by UV exposure. All constructed mixtures were first processed with STR-CE analysis as follows: amplification of a 1 ng input with Globalfiler full volume reactions, capillary electrophoresis fragment separation on Applied Biosystems 3500 xl Genetic Analyzer, and data analysis with GeneMapper IDX following internally validated SOPs. STRMix v2.9 was used to deconvolve and interpret STR-CE mixtures following internally validated SOPs.

To construct the mock touch evidence mixtures, a total of nine participants were identified. Each provided informed consent, and a buccal sample was collected from each to obtain their reference genotype. Trace DNA samples containing 3–5 contributors were made in triplicate by having donors handle items relevant to gun crimes, including: handgun frames, handgun magazines, rifle bolts, and brass 9 mm round cartridge cases. All substrates were decontaminated prior to handling via UV decontamination and/or bleach cleaning. Donors were instructed to wait two hours after washing their hands with warm water before handling the items with their dominant hand for 20 – 60 seconds. After handling, bullet samples were loaded individually into a gun chamber and fired. Fired cartridge casings were collected and individually packaged prior to extraction. Firearm substrate (non-casing) trace DNA samples were collected with wet/dry nylon flocked swabbing. DNA extraction was performed using the Qiagen EZ1&2 DNA Investigator kit in 500 µl Large Volume reactions following internally validated SOP. The rinse-and-swab collection method was performed on fired cartridge casings according to Bille et al (2020, https://doi.org/10.1016/j.fsigen.2020.102238). Cartridge casing samples collected via the rinse-and-swab method were extracted using the modified QIAamp DNA Investigator Kit method described by Bille et al (2020). DNA extracts were concentrated with Microcon DNA Fast flow filter units (MilliporeSigma, Burlington, MA) prior to quantification by Quantifiler Trio DNA Quantification Kit in 11 µl reaction volumes to assess DNA concentration, DNA degradation, and inhibition related to the various substrates. Donor references were amplified with 1 ng DNA. All recovered DNA from mock evidence mixture samples was targeted for amplification and library preparation with the AmpliSeq microhap assay as described below. In addition to touch evidence samples, a set of inhibited mixture samples were constructed. A 1 ng aliquot of NIST RGTM S8 3-persom mixture was combined with humic acid at concentrations of 50 ng, 150 ng, and 250 ng to examine amplification with the AmpliSeq reaction buffer in the presence of an inhibitor.

Then, all reference samples, sensitivity samples, constructed mixtures, and mock evidence samples were amplified using the custom microhaplotype AmpliSeq primer mix and AmpliSeq Library PLUS for Illumina prep kit following the manufacturer’s recommendations for Ampliseq for Illumina Custom Panels with one primer pool. First, DNA samples were target amplified in 20 µl reactions with amplification parameters of: 99 ˚C for 2 minutes, 23 cycles of 99 ˚C for 15 seconds and 60 ˚C for 4 minutes, and a final hold at 10 ˚C for up to 24 hours. Amplicons were then partially digested with 2 µl of FuPa Reagent on a thermal cycler as follows: 10 minutes at 50˚C, 10 minutes at 55˚C, and 20 minutes at 62˚C. Next, AmpliSeq CD Index i7 and i5 adapters were ligated to the partially digested amplicon as follows: 30 minutes at 22 ˚C, 5 minutes at 68 ˚C, and 5 minutes at 72 ˚C. After a second library amplification: 98 ˚C for 2 minutes, 7 cycles of 98 ˚C for 15 seconds and 64 ˚C for minute, and a final hold at 10 ˚C for up to 24 hours, libraries were purified with AMpure XP. Finally, libraries were quantified with the Qubit dsDNA HS assay on the Qubit 4 fluorometer and sized using the Agilent TapeStation 4120 and D1000 ScreenTapes (Agilent Technologies, Santa Clara, CA). For Illumina sequencing, libraries were diluted to 4 nM and pooled in equimolar proportions. All pools were diluted to a loading concentration of 9 pM with a 2% PhiX sequencing control, per manufacturer’s recommendations. Cluster generation and 2x300 paired-end sequencing were performed on the MiSeq FGx system using MiSeq v3 (600-cycle) reagents. Libraries were pooled in groups of no more than 40 to ensure adequate depth of coverage for every donor allele in the sample library.

Genotyping of all donor references and mixtures samples was performed in one of two ways:
1)Genotype analysis of the population samples was first performed as follows: mapping of sequence data to hg38 Canonical reference was performed with bwa mem and executed in Galaxy (usegalaxy.org). The resultant .bam files were further processed for microhap genotype calling using mh.jar, a JAVA-based application previously developed in collaboration with ThermoFisher and adapted for the current microhaplotype assay.
2)NexGenID (NexGen Forensic Sciences, Columbia, MD), was used to perform haplotype determination from raw .fastq files from all mixture and mock evidence samples as follows: cluster amplicon sequences based on locus, perform a local alignment, and identify unique alleles based on identical sequence. Analytical and stochastic thresholds are applied for identification of unique alleles above noise reads. Both thresholds are sample-specific, driven by input DNA quantity that dictates how many templates were initially added for amplification.

Haplotype frequencies were calculated based on phased SNP genotypes obtained from 1000 Genomes Phase 3 sequence data in the UCSC Genome Browser ([http://genome.ucsc.edu/]) as well as the additional 240 single-source donor samples, for a total of 35 populations evaluated. Ae values were calculated using the following formula: Ae=1/Σpi^2, where pi = frequency of alleles, and Informativeness (In) for measuring allele frequency differences among populations was calculated according to Rosenberg et al (2003; DOI:10.1086/380416).

Mixture deconvolution was performed using the unique probabilistic genotyping methods of NexGenID and EuroForMix v4.2.5 ([https://www.euroformix.com/]). Note, genotyping output from NexGenID was converted to EuroForMix-compatible format. Finally, likelihood ratios were calculated by both software packages under the specified hypothesis: Hp: person of interest included; Hd: all contributors unknown. Each known contributor to a given mixture was evaluated as the POI. The output included quantitative likelihood ratios (logLR) for weight-of-evidence reporting.

Additional comparative statistical analyses were performed in JMP® v18.0.1 statistical discovery software.

Usage notes:

Allele frequencies, data quality metrics, and mixture deconvolution LogLRs are compiled in .csv tables. Sensitivity, in vitro mixture, and mock evidence sample results are provided in separate files. Neither SNP genotypes nor sequence data are not provided to maintain donor privacy.