Comparative evaluation of genotyping technologies for investigative genetic genealogy in sexual assault casework
Data files
Nov 12, 2024 version files 542.44 KB
-
15PNIJ-21-GG-04143-MUMU_GenomeSequencing_GEDmatchPRO_MatchMetrics.xlsx
23.76 KB
-
15PNIJ-21-GG-04143-MUMU_GenomeSequencing_SampleMetrics_Degradation.xlsx
40.05 KB
-
15PNIJ-21-GG-04143-MUMU_GenomeSequencing_SampleMetrics_Sensitivity.xlsx
21.18 KB
-
15PNIJ-21-GG-04143-MUMU_GSAv2_GEDmatchPRO_MatchMetrics.xlsx
24.52 KB
-
15PNIJ-21-GG-04143-MUMU_GSAv2_SampleMetrics_Degradation.xlsx
30.92 KB
-
15PNIJ-21-GG-04143-MUMU_GSAv2_SampleMetrics_Sensitivity.xlsx
16.46 KB
-
15PNIJ-21-GG-04143-MUMU_Kintelligence_GEDmatchPRO_MatchMetrics.xlsx
30.31 KB
-
15PNIJ-21-GG-04143-MUMU_Kintelligence_SampleMetrics_Degradation.xlsx
48.31 KB
-
15PNIJ-21-GG-04143-MUMU_Kintelligence_SampleMetrics_Sensitivity.xlsx
23.90 KB
-
15PNIJ-21-GG-04143-MUMU_MD001_GEDmathPRO_MatchLists.docx
219.77 KB
-
README.md
63.27 KB
Abstract
Investigative Genetic Genealogy (IGG) offers a capability to identify investigative leads when CODIS searching is unproductive, and IGG can provide time efficient methods for removing perpetrators of serial violent crimes, such as rape and murder from the community, thereby increasing public safety. However, use of IGG has preceded establishment of best practices. The 2021 TWG operational requirements identified the need for further development, assessment, and evaluation of IGG testing procedures for use by crime labs. This study will support the TWG requirements by assessing the ability of genotyping technologies to develop useful profiles from low-template and degraded sexual assault samples for genealogical searching in law enforcement accessible Direct-to-Consumer (DTC) genealogical databases and support rapid, accurate, efficient identification of the samples’ source.
In Phase I, genotyping by Illumina Global Screening Array BeadChip, WGS on NovaSeq 6000, and targeted sequencing with Verogen ForenSeq Kintelligence Kit will be compared for sensitivity to low-level DNA input concentrations and specificity for artificially degraded DNA using whole semen and nascent semen DNA samples. The high-density SNP genotype profiles will be compared against databased genotypes in order to determine the maximum distance at which known or potential genealogical associations can be identified. In Phase II, the limitations will be further tested by generating a mock case scenario with laboratory-created challenging samples exhibiting both low-level concentration and DNA degradation utilizing a known donor for whom verified family members of known relationship distance greater than 5th degree (first cousin once removed) are present in DTC databases. After genotyping mock samples with methods most applicable to each sample’s particular characteristics, as determined by Phase I evaluations, a full genealogical investigative workflow conforming to the Genealogical Proof Standard will be applied to demonstrate whether or not increasingly distant relatives can be identified and at what distance identification is no longer possible.
Dissemination of the results of this study will provide the community with much needed systematic analyses and direct comparisons of available technologies and allow practitioners to make more informed decisions when working with limited resources. Results may assist in developing lab-specific criteria for processing irreplaceable DNA evidence samples with IGG and development of new, more efficient genealogical workflows. Study results will be disseminated to the forensic community through publications in peer-reviewed journals and presentations at scientific meetings.
https://doi.org/10.5061/dryad.g1jwstr04
Description of the Data and file structure
Tabular datasets are provide for each of the technologies evaluated. Compiled results are split into two datasets per technology: one dataset contains results from sensitivity studies and the second dataset contains results from artificially induced degradation studies.
Metrics compiled for SENSITIVITY STUDY samples processed with Qiagen ForenSeq KINTELLIGENCE are contained in 15PNIJ-21-GG-04143-MUMU_Kintelligence_SampleMetrics_Sensitivity.xlsx and include the sequence data quality metrics and sample quality metrics as follows:
Sample Name: Given sample identification name, combination of donor code and DNA input for assay testing.
Donor Name: Donor code utilized throughout testing.
Technology: Assay tested.
Sample Type - Comments: Designates the origin of the DNA sample. POSITIVE = Known DNA reference extract aliquot supplied with the assay. NEGATIVE = Negative amplification control of water.
Sequencer: Type of sequencing instrument used to generate genome sequence data.
Quantifiler Trio Degradation Index (DI): The determined DI ratio of the small autosomal target to large autosomal target resulting from quantitative PCR using ThermoFisher's Quantiflier Trio Assay.
Total DNA input (ng): The total amount of DNA added to the Qiagen ForenSeq Kintelligence assay for processing.
DNA Library quant (ng/µl): The determined dsDNA concentration of the final clean Kintelligence library generated by each sample.
Total Raw Reads (samtools): The total number of identified reads for the given sample barcode after Illumina MiSeq sequencing and demultiplexing, calculated using samtools flagstats.
Total Mapped Reads (samtools): The total number of demultiplexed reads for a given sample barcode mapped to the GRCh38 human genome reference, calculated using samtools flagstats.
Mapped and Filtered Library Reads (UAS): The final number of demultiplexed and mapped reads after quality filtering, calculated by Qiagen Universal Analysis Software (UAS).
Percent Reads Mapped & Filtered: The percentage of Mapped and Filtered Library Reads out of Total Raw Reads for the sample.
Average Locus Read Depth: The calculated average locus read depth across all observed target amplified SNP loci in the sample.
Standard Deviation Locus Read Depth: The associated standard deviation calculated for the average locus read depth of the sample.
Total SNPs typed (20X AT): The total number of target loci observed out of 10230 possible targets when evaluating the sequence results at a minimum locus read depth of 20X coverage, or a 3% analytical threshold as set by the UAS.
Call Rate (20X AT): The ratio of total loci observed vs total loci possible at a minimum locus read depth of 20X coverage, given as a percentage.
Total SNPs typed (10X AT): The total number of target loci observed out of 10230 possible targets when evaluating the sequence results at a minimum locus read depth of 10X coverage, or a 1.5% analytical threshold as set by the UAS.
Call Rate (10X AT): The ratio of total loci observed vs total loci possible at a minimum locus read depth of 10X coverage, given as a percentage.
No Call (10X AT): The total number of target SNP loci the resulted in a no genotype information when analyzed at a 10X coverage threshold.
Autosomal Intralocus Balance (10X AT): The average allele balance between alleles within a heterozygous locus, calculated by UAS.
Heterozygous SNPs (10X AT): The total number of target loci generating heterozygous genotypes in the sample.
Heterozygosity Rate (10X AT): Ratio of heterozygous loci vs all sequenced loci in the sample, given as a percent, calculated by UAS.
X-SNPs at 10X AT: The total number of observed loci categorized as X chromosome SNPs at a 10X coverage threshold.
X-SNPs Not Detected (10X AT): The number of SNPs categorize as X-SNPs expected but not observed in the sequence results for a sample at a 10X coverage threshold.
Y-SNPs at 10X AT: The total number of observed loci categorized as Y chromosome SNPs at a 10X coverage threshold.
Y-SNPs Not Detected (10X AT): The number of SNPs categorize as Y-SNPs expected but not observed in the sequence results for a sample at a 10X coverage threshold.
Ancestry SNPs at 10X AT: The total number of observed loci categorized as ancestry informative SNPs at a 10X coverage threshold.
Ancestry SNPs Not Detected (10X AT): The number of SNPs categorize as ancestry informative SNPs expected but not observed in the sequence results for a sample at a 10X coverage threshold.
Phenotype SNPs at 10X AT: The total number of observed loci categorized as phenotype informative SNPs at a 10X coverage threshold.
Phenotype SNPs Not Detected (10X AT): The number of SNPs categorize as phenotype informative SNPs expected but not observed in the sequence results for a sample at a 10X coverage threshold.
Identity SNPs at 10X AT: The total number of observed loci categorized as indentity informative SNPs at a 10X coverage threshold.
Identity SNPs Not Detected (10X AT): The number of SNPs categorize as identity informative SNPs expected but not observed in the sequence results for a sample at a 10X coverage threshold.
Kinship SNPs at 10X AT: The total number of observed loci categorized as kinship informative SNPs at a 10X coverage threshold.
Kinship SNPs Not Detected (10X AT): The number of SNPs categorize as kinship informative SNPs expected but not observed in the sequence results for a sample at a 10X coverage threshold.
<150 bp Loci Typed: The total number of target loci with amplicons less than 150 bp that were observed and genotyped in the sequence data at a 10X coverage threshold.
<150 bp Loci Not Typed: The number of target loci with amplicons less than 150 bp that were not observed and genotyped in the sequence data at a 10X coverage threshold.
\>=150 bp Loci Typed: The total number of target loci with amplicons greater than or equal to 150 bp that were observed and genotyped in the sequence data at a 10X coverage threshold.
\>=150 bp Loci Not Typed: The number of target loci with amplicons greater than or equal to 150 bp that were not observed and genotyped in the sequence data at a 10X coverage threshold.
Concordant Calls\_ 1ng REF_KINT: Total number of concordant genotype calls compared to a 1ng input reference Kintelligence genotype for the corresponding donor.
Kintelligence Percent Concordant Calls: The percent of the total number of genotypes observed for sample demonstrating concordant genotype results to the 1ng reference sample.
Kintelligence Discordance - Discordant Calls: Total number of discordant genotype calls compared to a 1ng input reference for the corresponding donor.
Kintelligence Discordance - Percent Discordant calls: The percent of the total number of genotypes observed for sample demonstrating discordant genotype results to the 1ng reference sample.
Kintelligence Discordance Locus Recovery - Genotype discordance due to genotype failed in 1 ng Ref: Total number of loci for a given sample that generated a genotype call but the corresponding locus in the reference sample did not generate a genotype call.
Kintelligence Discordance - Number Discordant Genotypes Due to False Homozygous Call: Total number loci that produced a homozygous call in the sample but a heterozygous call was observed in the reference sample.
Kintelligence Discordance - Number Discordant Genotypes Due to False Heterozygous Call: Total number loci that produced a heterozygous call in the sample but a homozygous call was observed in the reference sample.
Kintelligence Discordance - Number Discordant Genotypes Due to False Genotype: The total number of discordant called genotypes that demonstrated discordant homozygous allele calls, e.g., Kintelligence reference demonstrates genotypes as 0/0 while sample demonstrates genotype as 1/1.
[Note: The following metrics categories in this data set will describe cross-technology comparisons between a reference genotype generated via Illumina GSA and the sensitivity sample processed with Kintlligence]:
GSA-consistent SNPs called in Kintelligence Sample (9695 maximum): Out of a maximum of 9695 loci available for cross-technology comparisons (GSA vs Kintelligence), the number of SNP producing a genotype call in the sensitivty sample Kintelligence data.
GSA-consistent SNPs with NO CALL in Kintelligence Data: The total number of consistent SNP loci targeted by both technologies but with no observed genotype result in the Kintelligence sample data.
Number Concordant SNPs Called in GSA and Kintelligence Data: The total number of SNPs available for cross-technology comparison that produced concordant genotype calls in both Kintelligence and the GSA 200 ng input reference sample.
Percent Concordant SNPs Called in GSA and Kintelligence Data: The percent of the total number of observed SNPs available for cross-technology comparison that produced concordant genotype calls in both Kintelligence and the GSA 200 ng input reference sample.
GSA Discordance - Number SNP Genotype Differences Including NO CALLS: For comparative SNPs, the total number of loci determined to have discordant genotype calls, inclusive of loci that exhibited a NO CALL in either the GSA reference data or the Kintelligence sample data.
GSA Discordance - Number Discordant genotypes CALLED in Both Technologies: The total number of comparative loci with CALLED genotypes in both technologies that exhibited a discordant genotype.
GSA Discordance - Percent Discordant Genotypes CALLED in Both Technologies: the percent of the total loci with called genotypes that demonstrated discordant genotypes.
GSA Discordance - Number Discordant genotypes due to false Homozygous Call: The total number of discordant called genotypes that demonstrated a heterozygous genotype call in the GSA reference data but a homozygous genotype call in the Kintelligence data.
GSA Discordance - Number Discordant genotypes due to false Heterozygous Call: The total number of discordant called genotypes that demonstrated a homozygous genotype call in the GSA reference data but a heterozygous genotype call in the Kintelligence data.
GSA Discordance - Number Discordant genotypes due to false genotype: The total number of discordant called genotypes that demonstrated a discordant homozygous allele calls, e.g., GSA demonstrates genotypes as 0/0 while Kintelligence demonstrates genotype as 1/1.
GSA Discordance - Number Discordant genotypes due to NO CALL in GSA: Total number of comparative SNPs that did not generate a genotype result with GSA but produced a genotype with Kintelligence assay.
GEDmatch PRO Upload - Total Useable SNPs: Total number of available SNPs in the Kintelligence sample data determined "usable" by GEDmatch PRO during database upload.
GEDmatch PRO Upload - Total SNPs after Slimming: Total number of SNPs assigned to a GEDmatch PRO kit ID after database proprietary "slimming" algorithm applied.
GEDmatch PRO Upload - Percent SNPs used for matching: Percent of total SNPs remaining in kitID genotype after slimming.
GEDmatch PRO Upload - Total Matches: Total number of database matches determined for each uploaded Kintelligence profile.
Cells containing “n/a” indicate non-applicable metrics for the corresponding sample.
Metrics compiled for ARTIFICIAL DEGRADATION STUDY samples processed with Qiagen ForenSeq KINTELLIGENCE are contained in 15PNIJ-21-GG-04143-MUMU_Kintelligence_SampleMetrics_Degradation.xlsx and include the sequence data quality metrics and sample quality metrics as follows:
Sample Name: Given sample identification name, combination of donor code, artificial degradation method, and DNA input for assay testing.
Donor Name: Donor code utilized throughout testing.
Technology: Assay tested.
Study: Designates if the sample is an artificially degraded test sample or an undegraded, high quality reference DNA extract used for concordance comparisons. Study = None designates the associated amplification positive and negative controls.
Sample Type - Comments: Designates the origin of the DNA sample. POSITIVE = Known DNA reference extract aliquot supplied with the assay. NEGATIVE = Negative amplification control of water.
Sequencer: Type of sequencing instrument used to generate genome sequence data.
Degradation Method: Designates the method of artificial degradation applied to the sample.
Exposure Time Point: Categorical designation of exposure time 1 = lowest exposure, 4 = longest exposure.
Exposure Time Length: Specifies the numerical length of exposure time.
Quantifiler Trio Degradation Index (DI): The determined DI ratio of the small autosomal target to large autosomal target resulting from quantitative PCR using ThermoFisher's Quantiflier Trio Assay.
Genomic DIN: Measure of genomic DNA fragmentation as measured by Agilent TapeStation gel electrophoresis, DIN = DNA Integrity Number.
STR Percent Profile: The percent of STR alleles obtained out of the total expected STR alleles for the sample donor.
Average Profile Peak Height: Average peak height for alleles across all amplified STR loci in relative fluorescence units (RFU).
Profile Balance: Comparison between the highest STR loci allele peak heights and the lowest STR loci allele peak heights. N/A indicates no balance obtained due to locus dropout across the profile.
FI Value: Calculated Forensic Index (FI) value for the sample.
Total DNA input (ng): The total amount of DNA added to the Qiagen ForenSeq Kintelligence assay for processing.
DNA Library quant (ng/µl): The determined dsDNA concentration of the final clean Kintelligence library generated by each sample.
Total Raw Reads (samtools): The total number of identified reads for the given sample barcode after Illumina MiSeq sequencing and demultiplexing, calculated using samtools flagstats.
Total Mapped Reads (samtools): The total number of demultiplexed reads for a given sample barcode mapped to the GRCh38 human genome reference, calculated using samtools flagstats.
Mapped and Filtered Library Reads (UAS): The final number of demultiplexed and mapped reads after quality filtering, calculated by Qiagen Universal Analysis Software (UAS).
Percent Reads Mapped & Filtered: The percentage of Mapped and Filtered Library Reads out of Total Raw Reads for the sample.
Average Locus Read Depth: The calculated average locus read depth across all observed target amplified SNP loci in the sample.
Standard Deviation Locus Read Depth: The associated standard deviation calculated for the average locus read depth of the sample.
Total SNPs typed (10X AT): The total number of target loci observed out of 10230 possible targets when evaluating the sequence results at a minimum locus read depth of 10X coverage, or a 1.5% analytical threshold as set by the UAS.
Call Rate (10X AT): The ratio of total loci observed vs total loci possible at a minimum locus read depth of 10X coverage, given as a percentage.
No Call (10X AT): The total number of target SNP loci the resulted in a no genotype information when analyzed at a 10X coverage threshold.
Autosomal Intralocus Balance (10X AT): The average allele balance between alleles within a heterozygous locus, calculated by UAS.
Heterozygous SNPs (10X AT): The total number of target loci generating heterozygous genotypes in the sample.
Heterozygosity Rate (10X AT): Ratio of heterozygous loci vs all sequenced loci in the sample, given as a percent, calculated by UAS.
X-SNPs at 10X AT: The total number of observed loci categorized as X chromosome SNPs at a 10X coverage threshold.
X-SNPs Not Detected (10X AT): The number of SNPs categorize as X-SNPs expected but not observed in the sequence results for a sample at a 10X coverage threshold.
Y-SNPs at 10X AT: The total number of observed loci categorized as Y chromosome SNPs at a 10X coverage threshold.
Y-SNPs Not Detected (10X AT): The number of SNPs categorize as Y-SNPs expected but not observed in the sequence results for a sample at a 10X coverage threshold.
Ancestry SNPs at 10X AT: The total number of observed loci categorized as ancestry informative SNPs at a 10X coverage threshold.
Ancestry SNPs Not Detected (10X AT): The number of SNPs categorize as ancestry informative SNPs expected but not observed in the sequence results for a sample at a 10X coverage threshold.
Phenotype SNPs at 10X AT: The total number of observed loci categorized as phenotype informative SNPs at a 10X coverage threshold.
Phenotype SNPs Not Detected (10X AT): The number of SNPs categorize as phenotype informative SNPs expected but not observed in the sequence results for a sample at a 10X coverage threshold.
Identity SNPs at 10X AT: The total number of observed loci categorized as identity informative SNPs at a 10X coverage threshold.
Identity SNPs Not Detected (10X AT): The number of SNPs categorize as identity informative SNPs expected but not observed in the sequence results for a sample at a 10X coverage threshold.
Kinship SNPs at 10X AT: The total number of observed loci categorized as kinship informative SNPs at a 10X coverage threshold.
Kinship SNPs Not Detected (10X AT): The number of SNPs categorize as kinship informative SNPs expected but not observed in the sequence results for a sample at a 10X coverage threshold.
<150 bp Loci Typed: The total number of target loci with amplicons less than 150 bp that were observed and genotyped in the sequence data at a 10X coverage threshold.
<150 bp Loci Not Typed: The number of target loci with amplicons less than 150 bp that were not observed and genotyped in the sequence data at a 10X coverage threshold.
\>=150 bp Loci Typed: The total number of target loci with amplicons greater than or equal to 150 bp that were observed and genotyped in the sequence data at a 10X coverage threshold.
\>=150 bp Loci Not Typed: The number of target loci with amplicons greater than or equal to 150 bp that were not observed and genotyped in the sequence data at a 10X coverage threshold.
Kintelligence Concordant Calls: The number of concordant genotype calls compared to a 1ng input reference Kintelligence genotype for the corresponding donor.
Kintelligence Percent Concordant Calls: The percent of the total number of genotypes observed for sample demonstrating concordant genotype results compared to the 1ng reference sample.
Kintelligence Discordance - Discordant Calls: The number of discordant genotype calls compared to a 1ng input reference for the corresponding donor.
Kintelligence Discordance - Percent Discordant calls: The percent of the total number of genotypes observed for sample demonstrating discordant genotype results to the 1ng reference sample
Kintelligence Discordance Locus Recovery - Genotype discordance due to genotype failed in 1 ng Ref: Total number of loci for a given sample that generated a genotype call but the corresponding locus in the reference sample did not generate a genotype call.
Kintelligence Discordance - Number Discordant Genotypes Due to False Homozygous Call: The number loci that produced a homozygous call in the sample but a heterozygous call was observed in the reference sample.
Kintelligence Discordance - Number Discordant Genotypes Due to False Heterozygous Call: The number loci that produced a heterozygous call in the sample but a homozygous call was observed in the reference sample.
Kintelligence Discordance - Number Discordant Genotypes Due to False Genotype: The number of discordant called genotypes that demonstrated discordant homozygous allele calls, e.g., Kintelligence reference demonstrates genotypes as 0/0 while sample demonstrates genotype as 1/1.
[Note: The following metrics categories in this data set will describe cross-technology comparisons between a reference genotype generated via Illumina GSA and the artificial degradation samples processed with Kintelligence]:
GSA-consistent SNPs called in Kintelligence Sample (9695 maximum): Out of a maximum of 9695 loci available for cross-technology comparisons (GSA vs Kintelligence), the number of SNP producing a genotype call in the artificial degradation sample Kintelligence data.
GSA-consistent SNPs with NO CALL in Kintelligence Data: The total number of consistent SNP loci targeted by both technologies but with no observed genotype result in the Kintelligence sample data.
Number Concordant SNPs Called in GSA and Kintelligence Data: The total number of SNPs available for cross-technology comparison that produced concordant genotype calls in both Kintelligence and the GSA 200 ng input reference sample.
Percent Concordant SNPs Called in GSA and Kintelligence Data: The percent of the total number of observed SNPs available for cross-technology comparison that produced concordant genotype calls in both Kintelligence and the GSA 200 ng input reference sample.
GSA Discordance - Number SNP Genotype Differences Including NO CALLS: For comparative SNPs, the total number of loci determined to have discordant genotype calls, inclusive of loci that exhibited a NO CALL in either the GSA reference data or the Kintelligence sample data.
GSA Discordance - Number Discordant genotypes CALLED in Both Technologies: The total number of comparative loci with CALLED genotypes in both technologies that exhibited a discordant genotype.
GSA Discordance - Percent Discordant Genotypes CALLED in Both Technologies: the percent of the total loci with called genotypes that demonstrated discordant genotypes.
GSA Discordance - Number Discordant genotypes due to false Homozygous Call: The total number of discordant called genotypes that demonstrated a heterozygous genotype call in the GSA reference data but a homozygous genotype call in the Kintelligence data.
GSA Discordance - Number Discordant genotypes due to false Heterozygous Call: The total number of discordant called genotypes that demonstrated a homozygous genotype call in the GSA reference data but a heterozygous genotype call in the Kintelligence data.
GSA Discordance - Number Discordant genotypes due to false genotype: The total number of discordant called genotypes that demonstrated a discordant homozygous allele calls, e.g., GSA demonstrates genotypes as 0/0 while Kintelligence demonstrates genotype as 1/1.
GSA Discordance - Number Discordant genotypes due to NO CALL in GSA: Total number of comparative SNPs that did not generate a genotype result with GSA but produced a genotype with Kintelligence assay.
GEDmatch PRO Upload - Total Useable SNPs: Total number of available SNPs in the Kintelligence sample data determined "usable" by GEDmatch PRO during database upload.
GEDmatch PRO Upload - Total SNPs after Slimming: Total number of SNPs assigned to a GEDmatch PRO kit ID after database proprietary "slimming" algorithm applied.
GEDmatch PRO Upload - Percent SNPs used for matching: Percent of total SNPs remaining in kitID genotype after slimming.
GEDmatch PRO Upload - Total Matches: Total number of database matches determined for each uploaded Kintelligence profile.
Cells containing “n/a” indicate non-applicable metrics for the corresponding sample.
Metrics compiled for SENSITIVITY STUDY samples processed with Illumina GENOME SEQUENCING are contained in 15PNIJ-21-GG-04143-MUMU_GenomeSequencing_SampleMetrics_Sensitivity.xlsx and include sequence data quality metrics and sample quality metrics as follows:
Sample Name: Given sample identification name, combination of donor code, artificial degradation method, and DNA input for assay testing.
Donor Name: Donor code utilized throughout testing.
Technology: Designates the assay method used for sample generation.
Sample Type - Comments: Designates the origin of the DNA sample.
Sequencer: Type of sequencing instrument used to generate genome sequence data.
Quantifiler Trio Degradation Index (DI): The determined DI ratio of the small autosomal target to large autosomal target resulting from quantitative PCR using ThermoFisher's Quantiflier Trio Assay.
Total DNA input (ng): The total amount of DNA added to the library preparation assay for sample processing.
Total Raw Reads (samtools): The total number of identified reads for the given sample barcode after Illumina NovaSeq 6000 sequencing and demultiplexing, calculated using samtools flagstats.
Total Mapped Reads (samtools): The total number of demultiplexed reads for a given sample barcode mapped to the GRCh38 human genome reference, calculated using samtools flagstats.
Total Mapped %: The percentage of Mapped and Filtered Library Reads out of Total Raw Reads for the sample.
Average Read Length (bp): Ratio between total length and total sequences.
Average Insert Size (bp): The average absolute template length for paired and mapped reads.
Insert Length STD: Standard deviation for the average template length distribution.
Average Quality: Ratio between the sum of base qualities and total length.
Marked Duplicates Read Count: Number of sequencing reads marked as duplicates.
Percent Marked Duplicates: Percent of total sequencing reads marked as duplicates.
Average Autosomal Coverage: The number of bases that aligned to autosomes divided by the total number of bases in the autosomes.
% of Genome with Coverage ≥1X: Percent of total number of bases in the genome covered by at least 1 sequencing read.
% of Genome with Coverage ≥10X: Percent of total number of bases in the genome covered by at least 10 sequencing reads.
% of Genome with Coverage ≥20X: Percent of total number of bases in the genome covered by at least 20 sequencing reads.
Total IGG SNPs: Total number of target SNP loci observed in the sequence data.
IGG SNPs Call Rate (2,061,275 Maximum): Percent of observed SNPs out of a possible maximum of 2,061,275 SNPs.
WGS Concordant SNPs: The number of concordant genotype calls compared to a 50 ng input reference genome sequencing genotype for the corresponding donor.
WGS Concordance Rate: The percent of the total number of genotypes observed for sample demonstrating concordant genotype results to the 50 ng reference sample.
WGS Discordance - Number Genotype Differences from 50 ng REF: The number of discordant genotype calls compared to a 50 ng input reference for the corresponding donor.
WGS Discordance - Discordance rate: The percent of the total number of genotypes observed for sample demonstrating discordant genotype results compared to the 50 ng reference sample.
WGS Discordance - Number Discordant genotypes due to false genotype: The number of discordant called genotypes that demonstrated discordant allele calls not explained as allele loss or allele drop-in, e.g., 0/0 vs 1/1 or 0/1 vs 0/2.
WGS Discordance - Number Discordant genotypes due to false Heterozygous Call: The number of discordant called genotypes that demonstrated a homozygous genotype call in the genome sequence 50 ng reference data but a heterozygous genotype call in the sample data.
WGS Discordance - Number Discordant genotypes due to false Homozygous Call: The number of discordant called genotypes that demonstrated a heterozygous genotype call in the genome sequence 50 ng reference data but a homozygous genotype call in the sample data.
[Note: The following metrics categories in this data set will describe cross-technology comparisons between a reference genotype generated via Illumina GSA and the sensitivity samples processed via Illumina NovaSeq 6000]:
Total GSA SNPs Observed in WGS Data: The number of SNPs found on both the Illumina GSA BeadChip and in the genome sequencing data for the corresponding sample.
GSA SNP call rate in WGS Data: Percent of observed SNPs in genome sequencing data out of a possible maximum of 618,564 SNPs consistent between technologies.
Number Concordant SNPs Called in GSA and WGS Data: The total number of SNPs available for cross-technology comparison that produced concordant genotype calls in both genome sequence sensitivity sample data and the GSA 200 ng input reference.
Percent Concordant SNPs Called in GSA and WGS Data: Percent of the total number of SNPs available for cross-technology comparison that produced concordant genotype calls in both genome sequence sensitivity sample data and the GSA 200 ng input reference.
GSA Concordant SNPs plus OppStrandReport Differences: The number of SNPs available for cross-technology comparison that produced concordant genotype calls in both genome sequence sensitivity sample data and the GSA 200 ng input reference, including SNPs with genotypes likely reported on the reverse strand in the GSA reference data thus not a true difference.
GSA Discordance - Number SNP Genotype Differences Including NO CALLS: The total number of SNPs available for cross-technology comparison that produced discordant genotype calls in both genome sequence sensitivity sample data and the GSA 200 ng input reference.
GSA Discordance - Percent Total SNP Genotypes Discordant to GSA: Percent of the total number of SNPs available for cross-technology comparison that produced discordant genotyping results in both genome sequence sensitivity sample data and the GSA 200 ng input reference inclusive of a NO CALL result in GSAv2 reference data.
GSA Discordance - Number Discordant genotype, likely OppStrandReport: The number of designated discordant SNP genotypes likely due to opposite strand reporting in the GSA reference data.
GSA Discordance - Number SNP Genotype Differences minus OppStrandReport Differences: The number of SNPs available for cross-technology comparison that produced discordant genotype calls in both genome sequence sensitivity sample data and the GSA 200 ng input reference, excluding SNPs with genotypes likely reported on the reverse strand in the GSA reference data.
GSA Discordance - Number Discordant genotypes due to NO CALL in GSA: The number of comparative SNPs that did not generate a genotype result in the GSA reference sample but produced a genotype call in the genome sequencing sensitivity sample.
GSA Discordance - Number Discordant genotypes CALLED in Both Technologies : The total number of loci CALLED both genome sequence sensitivity sample data and the GSA 200 ng input reference resulting in a discordant genotype for the corresponding sample.
GSA Discordance - Percent Discordant Genotypes CALLED in Both Technologies: Percent of the total number of SNPs available for cross-technology comparison that produced discordant genotype CALLS in both genome sequence sensitivity sample data and the GSA 200 ng input reference.
GSA Discordance - Number Discordant genotypes due to false genotype: The number of discordant called genotypes that demonstrated discordant allele calls not explained as allele loss or allele drop-in, e.g., 0/0 vs 1/1 or 0/1 vs 0/2.
GSA Discordance - Number Discordant genotypes due to false Heterozygous Call: The number of discordant called genotypes that demonstrated a homozygous genotype call in the GSA 200 ng reference data but a heterozygous genotype call in the sample data.
GSA Discordance - Number Discordant genotypes due to false Homozygous Call: The number of discordant called genotypes that demonstrated a heterozygous genotype call in the GSA 200 ng reference data but a homozygous genotype call in the sample data.
Heterozygosity of GSA consistent SNP genotypes: Ratio of heterozygous genotypes observed in genome sequencing data at loci consistent with SNP loci on the GSA BeadChip to total number of observed genotypes at loci consistent with SNP loci on the GSA BeadChip.
GEDmatch PRO Upload - Total Useable SNPs: Total number of available SNPs in the genome sequencing sensitivity sample data determined "usable" by GEDmatch PRO during database upload.
GEDmatch PRO Upload - Total SNPs after Slimming: Total number of SNPs assigned to a GEDmatch PRO kit ID after database proprietary "slimming" algorithm applied.
GEDmatch PRO Upload - Percent SNPs used for matching: Percent of total SNPs remaining in kitID genotype after slimming.
GEDmatch PRO Upload - Total Matches: Total number of database matches determined for each uploaded genome sequencing sensitivity sample profile.
Cells containing “n/a” indicate non-applicable metrics for the corresponding sample.
“null” indicates missing data in one cell of column AW due to removal of the corresponding sample from GEDmatch before the metric was collected.
Metrics compiled for ARTIFICIAL DEGRADATION STUDY samples processed with Illumina GENOME SEQUENCING are contained in 15PNIJ-21-GG-04143-MUMU_GenomeSequencing_SampleMetrics_Degradation.xlsx and include sequence data quality metrics and sample quality metrics as follows:
Sample Name: Given sample identification name, combination of donor code, artificial degradation method, and DNA input for assay testing.
Donor Name: Donor code utilized throughout testing.
Technology: Designates the assay method used for sample generation.
Study: Designates if the sample is an artificially degraded test sample or an undegraded, high quality reference DNA extract used for concordance comparisons.
Sample Type - Comments: Designates the origin of the DNA sample.
Sequencer: Type of sequencing instrument used to generate genome sequence data.
Degradation Method: Designates the method of artificial degradation applied to the sample.
Exposure Time Point: Categorical designation of exposure time 1 = lowest exposure, 4 = longest exposure.
Exposure Time Length: Specifies the numerical length of exposure time.
Quantifiler Trio Degradation Index (DI): The determined DI ratio of the small autosomal target to large autosomal target resulting from quantitative PCR using ThermoFisher's Quantiflier Trio Assay.
Genomic DIN: Measure of genomic DNA fragmentation as measured by Agilent TapeStation gel electrophoresis, DIN = DNA Integrity Number.
STR Percent Profile: The percent of STR alleles obtained out of the total expected STR alleles for the sample donor.
Average Profile Peak Height: Average peak height for alleles across all amplified STR loci in relative fluorescence units (RFU).
Profile Balance: Comparison between the highest STR loci allele peak heights and the lowest STR loci allele peak heights. N/A indicates no balance obtained due to locus dropout across the profile.
FI Value: Calculated Forensic Index (FI) value for the sample.
Total DNA input (ng): The total amount of DNA added to the library preparation assay for sample processing.
Total Raw Reads (samtools): The total number of identified reads for the given sample barcode after Illumina NovaSeq 6000 sequencing and demultiplexing, calculated using samtools flagstats.
Total Mapped Reads (samtools): The total number of demultiplexed reads for a given sample barcode mapped to the GRCh38 human genome reference, calculated using samtools flagstats.
Total Mapped %: The percentage of Mapped and Filtered Library Reads out of Total Raw Reads for the sample.
Average Read Length (bp): Ratio between total length and total sequences.
Average Insert Size (bp): The average absolute template length for paired and mapped reads.
Insert Length STD: Standard deviation for the average template length distribution.
Average Quality: Ratio between the sum of base qualities and total length.
Marked Duplicates Read Count: Number of sequencing reads marked as duplicates.
Percent Marked Duplicates: Percent of total sequencing reads marked as duplicates.
Estimated sample contamination: The estimated fraction of reads in a sample that may be from another human source.
Average Autosomal Coverage: The number of bases that aligned to autosomes divided by the total number of bases in the autosomes.
% of Genome with Coverage ≥1X: Percent of total number of bases in the genome covered by at least 1 sequencing read.
% of Genome with Coverage ≥10X: Percent of total number of bases in the genome covered by at least 10 sequencing reads.
% of Genome with Coverage ≥20X: Percent of total number of bases in the genome covered by at least 20 sequencing reads.
Total IGG SNPs: Total number of target SNP loci observed in the sequence data.
IGG SNPs Call Rate (2,061,275 Maximum): Percent of observed SNPs out of a possible maximum of 2,061,275 SNPs.
WGS Concordant SNPs: The number of concordant genotype calls compared to a 50 ng input reference genome sequencing genotype for the corresponding donor .
WGS Concordance Rate: The percent of the total number of genotypes observed for sample demonstrating concordant genotype results to the 50 ng reference sample.
WGS Discordance - Number Genotype Differences from 50 ng REF: The number of discordant genotype calls compared to a 50 ng input reference for the corresponding donor.
WGS Discordance - Discordance rate: The percent of the total number of genotypes observed for sample demonstrating discordant genotype results compared to the 50 ng reference sample.
WGS Discordance - Number Discordant genotypes due to false genotype: The number of discordant called genotypes that demonstrated discordant allele calls not explained as allele loss or allele drop-in, e.g., 0/0 vs 1/1 or 0/1 vs 0/2.
WGS Discordance - Number Discordant genotypes due to false Heterozygous Call: The number of discordant called genotypes that demonstrated a homozygous genotype call in the genome sequence 50 ng reference data but a heterozygous genotype call in the sample data.
WGS Discordance - Number Discordant genotypes due to false Homozygous Call: The number of discordant called genotypes that demonstrated a heterozygous genotype call in the genome sequence 50 ng reference data but a homozygous genotype call in the sample data.
[Note: The following metrics categories in this data set will describe cross-technology comparisons between a reference genotype generated via Illumina GSA and the artificial degradation samples processed via Illumina NovaSeq 6000]:
Total GSA SNPs Observed in WGS Data: The number of SNPs found on both the Illumina GSA BeadChip and in the genome sequencing data for the sample.
GSA SNP call rate in WGS Data: Percent of observed SNPs in genome sequencing data out of a possible maximum of 618,564 SNPs consistent between technologies.
Number Concordant SNPs Called in GSA and WGS Data: The total number of SNPs available for cross-technology comparison that produced concordant genotype calls in both genome sequence artificial degradation sample data and the GSA 200 ng input reference.
Percent Concordant SNPs Called in GSA and WGS Data: Percent of the total number of SNPs available for cross-technology comparison that produced concordant genotype calls in both genome sequence artificial degradation sample data and the GSA 200 ng input reference.
GSA Concordant SNPs plus OppStrandReport Differences: The number of SNPs available for cross-technology comparison that produced concordant genotype calls in both genome sequence artificial degradation sample data and the GSA 200 ng input reference, including SNPs with genotypes likely reported on the reverse strand in the GSA reference data.
Percent Concordant to GSA calls +OppStrandReport: Percent of total number of SNPs available for cross-technology comparison that produced concordant genotype calls in both genome sequence artificial degradation sample data and the GSA 200 ng input reference, including SNPs with genotypes likely reported on the reverse strand in the GSA reference data thus not a true difference.
GSA Discordance - Number SNP Genotype Differences Including NO CALLS: The total number of SNPs available for cross-technology comparison that produced discordant genotype calls in both genome sequence artificial degradation sample data and the GSA 200 ng input reference.
GSA Discordance - Percent Total SNP Genotypes Discordant to GSA : Percent of the total number of SNPs available for cross-technology comparison that produced discordant genotyping results in both genome sequence artificial degradation sample data and the GSA 200 ng input reference inclusive of a NO CALL result in GSAv2 reference data.
GSA Discordance - Number Discordant genotype, likely OppStrandReport: The number of designated discordant SNP genotypes likely due to opposite strand reporting in the GSA reference data.
GSA Discordance - Number SNP Genotype Differences minus OppStrandReport Differences: The number of SNPs available for cross-technology comparison that produced discordant genotype calls in both genome sequence artificial degradation sample data and the GSA 200 ng input reference, excluding SNPs with genotypes likely reported on the reverse strand in the GSA reference data.
GSA Discordance - Number Discordant genotype due to NO CALL in GSA: The number of comparative SNPs that did not generate a genotype result in the GSA reference sample but produced a genotype in the genome sequencing artificial degradation sample.
GSA Discordance - Number Discordant genotypes CALLED in Both Technologies : The total number of loci CALLED both genome sequence artificial degradation sample data and the GSA 200 ng input reference resulting in a discordant genotype for the corresponding sample.
GSA Discordance - Percent Discordant Genotypes CALLED in Both Technologies: Percent of the total number of SNPs available for cross-technology comparison that produced discordant genotype CALLS in both genome sequence artificial degradation sample data and the GSA 200 ng input reference .
GSA Discordance - Number Discordant genotype due to false genotype: The number of discordant called genotypes that demonstrated discordant allele calls not explained as allele loss or allele drop-in, e.g., 0/0 vs 1/1 or 0/1 vs 0/2.
GSA Discordance - Number Discordant genotype due to false Heterozygous Call: The number of discordant called genotypes that demonstrated a homozygous genotype call in the GSA 200 ng reference data but a heterozygous genotype call in the sample data.
GSA Discordance - Number Discordant genotype due to false Homozygous Call: The number of discordant called genotypes that demonstrated a heterozygous genotype call in the GSA 200 ng reference data but a homozygous genotype call in the sample data.
GSA Discordance - Number Discordant genotype, likely OppStrandReport: The number of designated discordant SNP genotypes likely due to opposite strand reporting in the GSA reference data.
Heterozygosity of GSA consistent SNP genotypes: Ratio of heterozygous genotypes observed in genome sequencing data at loci consistent with SNP loci on the GSA BeadChip to total number of observed genotypes at loci consistent with SNP loci on the GSA BeadChip.
GEDmatch PRO Upload - Total Useable SNPs: Total number of available SNPs in the genome sequencing artificial degradation sample data determined "usable" by GEDmatch PRO during database upload.
GEDmatch PRO Upload - Total SNPs after Slimming: Total number of SNPs assigned to a GEDmatch PRO kit ID after database proprietary "slimming" algorithm applied.
GEDmatch PRO Upload - Percent SNPs used for matching: Percent of total SNPs remaining in kitID genotype after slimming.
GEDmatch PRO Upload - Total Matches: Total number of database matches determined for each uploaded genome sequencing artificial degradation sample profile.
Cells containing “n/a” indicate non-applicable metrics for the corresponding sample.
Metrics compiled for SENSITIVITY STUDY samples processed with Illumina GSAv2 BEADCHIPS are contained in 15PNIJ-21-GG-04143-MUMU_GSAv2_SampleMetrics_Sensitivity.xlsx and include sample quality metrics as follows:
Sample Name: Given sample identification name, combination of donor code, artificial degradation method, and DNA input for assay testing.
Donor Name: Donor code utilized throughout testing.
Technology: Designates the assay method used for sample generation.
Sample Type - Comments: Designates the origin of the DNA sample.
Quantifiler Trio Degradation Index (DI): The determined DI ratio of the small autosomal target to large autosomal target resulting from quantitative PCR using ThermoFisher's Quantiflier Trio Assay.
Total DNA input (ng): The total amount of DNA added to the library preparation assay for sample processing.
GSAv2 SNP Calls: The number of genotyped target SNP loci observed in the sample results.
Percent Called: Percent of total number of target SNPs on the GSAv2 BeadChip that produced a genotype call, maximum number of SNPs = 630033.
Heterozygosity: Percent of observed heterozygous genotypes out of the total number of resultant genotypes per sample.
No Calls: The number of GSAv2 loci interrogated generating no genotype call.
No Calls Percent: Percent of total GSAv2 loci interrogated generating no genotype call.
Number Concordant Sites: The number of concordant genotype results compared to a 200 ng input reference GSAv2 genotype for the corresponding donor, inclusive of NO CALL in both sample and reference.
Number Concordant NO CALL in both Sample and Reference: The number of SNP sites generating no genotype call in both the sensitivity sample and the corresponding donor reference genotype.
Number CALLED Concordant Genotypes: The number of concordant genotypes called compared to a 200 ng input reference GSAv2 genotype for the corresponding donor.
Percent Concordant CALLED genotypes: Percentage of called sites in the sample data that are concordant with the 200 ng reference genotype.
GSA Discordance - Number of Discordant Loci Including NO CALL in Reference: The total number of discordant loci in the sensitivity sample including sites with NO CALL in the reference data.
GSA Discordance Locus Recovery - Genotype discordance due to genotype failed in 200 ng REF: The number of discordant sample SNP genotypes due to NO CALL at corresponding sites in the reference sample.
GSA Discordance - Number Discordant genotypes CALLED in sample and reference: The total number of discordant genotypes at sites called in both the sensitivity sample and the 200 ng reference sample.
GSA Discordance - Percent Discordant Genotypes CALLED in sample including NO CALL in Reference: The percent of the total number of genotypes observed in sample demonstrating discordant genotype results to the 200 ng reference sample, inclusive of NO CALLS in reference.
GSA Discordance - Percent Discordant Genotypes CALLED in sample and reference: The percent of the total number of genotypes at SNP sites CALLED in both sample and 200 ng reference demonstrating discordant genotype results.
GSA Discordance - Discordant genotypes due to false genotype: The number of discordant called genotypes that demonstrated discordant homozygous allele calls, e.g., GSAv2 reference demonstrates genotypes as 0/0 while sample demonstrates genotype as 1/1.
GSA Discordance - Discordant genotypes due to false Heterozygous Call: The number of discordant called genotypes that demonstrated a homozygous genotype call in the GSA 200 ng reference data but a heterozygous genotype call in the GSAv2 sensitivity sample data.
GSA Discordance - Discordant genotypes due to false Homozygous Call: The number of discordant called genotypes that demonstrated a heterozygous genotype call in the GSA 200 ng reference data but a homozygous genotype call in the GSAv2 sensitivity sample data.
GEDmatch PRO Upload - Total Useable SNPs: Total number of available SNPs in the GSAv2 sensitivity sample data determined "usable" by GEDmatch PRO during database upload.
GEDmatch PRO Upload - Total SNPs after Slimming: Total number of SNPs assigned to a GEDmatch PRO kit ID after database proprietary "slimming" algorithm applied.
GEDmatch PRO Upload - Percent SNPs used for matching: Percent of total SNPs remaining in kitID genotype after slimming.
GEDmatch PRO Upload - Total Matches: Total number of database matches determined for each uploaded GSAv2 sensitivity sample profile.
Cells containing “n/a” indicate non-applicable metrics for the corresponding sample.
“null” indicates missing data in cells of column X, Y, Z, and AA due to removal of the corresponding sample from GEDmatch before the metric was collected.
Metrics compiled for ARTIFICIAL DEGRADATION STUDY samples processed with Illumina GSAv2 BEADCHIPS are contained in 15PNIJ-21-GG-04143-MUMU_GSAv2_SampleMetrics_Degradation.xlsx and include sample quality metrics as follows:
Sample Name: Given sample identification name, combination of donor code, artificial degradation method, and DNA input for assay testing.
Donor Name: Donor code utilized throughout testing.
Technology: Designates the assay method used for sample generation.
Study: Designates if the sample is an artificially degraded test sample or an undegraded, high quality reference DNA extract used for concordance comparisons.
Sample Type - Comments: Designates the origin of the DNA sample.
Degradation Method: Designates the method of artificial degradation applied to the sample.
Exposure Time Point: Categorical designation of exposure time 1 = lowest exposure, 4 = longest exposure.
Exposure Time Length: Specifies the numerical length of exposure time.
Quantifiler Trio Degradation Index (DI): The determined DI ratio of the small autosomal target to large autosomal target resulting from quantitative PCR using ThermoFisher's Quantiflier Trio Assay.
Genomic DIN: Measure of genomic DNA fragmentation as measured by Agilent TapeStation gel electrophoresis, DIN = DNA Integrity Number.
STR Percent Profile: The percent of STR alleles obtained out of the total expected STR alleles for the sample donor.
Average Profile Peak Height: Average peak height for alleles across all amplified STR loci in relative fluorescence units (RFU).
Profile Balance: Comparison between the highest STR loci allele peak heights and the lowest STR loci allele peak heights. N/A indicates no balance obtained due to locus dropout across the profile.
FI Value: Calculated Forensic Index (FI) value for the sample.
Total DNA input (ng): The total amount of DNA added to the library preparation assay for sample processing.
GSAv2 SNP Calls: The number of genotyped target SNP loci observed in the sample results.
Percent Called: Percent of total number of target SNPs on the GSAv2 BeadChip that produced a genotype call, maximum number of SNPs = 630033.
Heterozygosity: Percent of observed heterozygous genotypes out of the total number of resultant genotypes per sample.
No Calls: The number of SNP loci generating no genotype call.
No Calls Percent: Percent of total SNP loci interrogated generating no genotype call.
Number Concordant Sites: The number of concordant genotype results compared to a 200 ng input reference GSAv2 genotype for the corresponding donor, inclusive of NO CALL in both sample and reference.
Number Concordant NO CALL in both Sample and Reference: The number of SNP sites generating no genotype call in both the artificial degradation sample and the corresponding donor reference genotype.
Number CALLED Concordant Genotypes: The number of concordant genotypes called compared to a 200 ng input reference GSAv2 genotype for the corresponding donor.
Percent Concordant CALLED genotypes: Percentage of called sites in the sample data that are concordant with the 200 ng reference genotype.
GSA Discordance - Number of Discordant Loci Including NO CALL in Reference: The total number of discordant loci in the artificial degradation sample including sites with NO CALL in the reference data.
GSA Discordance Locus Recovery - Genotype discordance due to genotype failed in 200 ng REF: The number of discordant sample SNP genotypes due to NO CALL at corresponding sites in the reference sample.
GSA Discordance - Number Discordant genotypes CALLED in sample and reference: The total number of discordant genotypes at sites called in both the artificial degradation sample and the 200 ng reference sample.
GSA Discordance - Percent Discordant Genotypes CALLED in sample including NO CALL in Reference: The percent of the total number of genotypes observed in sample demonstrating discordant genotype results to the 200 ng reference sample, inclusive of NO CALLS in reference.
GSA Discordance - Percent Discordant Genotypes CALLED in sample and reference: The percent of the total number of genotypes at SNP sites CALLED in both sample and 200 ng reference demonstrating discordant genotype results.
GSA Discordance - Discordant genotypes due to false genotype: The number of discordant called genotypes that demonstrated discordant homozygous allele calls, e.g., GSAv2 reference demonstrates genotypes as 0/0 while sample demonstrates genotype as 1/1.
GSA Discordance - Discordant genotypes due to false Heterozygous Call: The number of discordant called genotypes that demonstrated a homozygous genotype call in the GSA 200 ng reference data but a heterozygous genotype call in the GSAv2 artificial degradation sample data.
GSA Discordance - Discordant genotypes due to false Homozygous Call: The number of discordant called genotypes that demonstrated a heterozygous genotype call in the GSA 200 ng reference data but a homozygous genotype call in the GSAv2 artificial degradation sample data.
GEDmatch PRO Upload - Total Useable SNPs: Total number of available SNPs in the GSAv2 artificial degradation sample data determined "usable" by GEDmatch PRO during database upload.
GEDmatch PRO Upload - Total SNPs after Slimming: Total number of SNPs assigned to a GEDmatch PRO kit ID after database proprietary "slimming" algorithm applied.
GEDmatch PRO Upload - Percent SNPs used for matching: Percent of total SNPs remaining in kitID genotype after slimming.
GEDmatch PRO Upload - Total Matches: Total number of database matches determined for each uploaded GSAv2 artificial degradation sample profile.
Cells containing “n/a” indicate non-applicable metrics for the corresponding sample
GEDmatch genealogical match metrics are provided for each Phase I (SENSITIVITY and ARTIFICIAL DEGRADATION) and Phase II (MOCK CASEWORK) test sample and separated by genotyping technology. Results for the GSAv2 and genome sequencing datasets are compiled from One-to-Many Segment Based comparisons in GEDmatch PRO. Results for Kintelligence datasets are compiled from One-to-Many Kinship DNA comparisons in GEDmatch PRO. Self matching metrics are derived from comparisons to donor-specific GSAv2 200 ng reference sample kits. Matching metrics are only provided for the known relative comparisons.
GEDmatch PRO matching metrics for sample kits derived from KINTELLIGENCE genotyping results are compiled in files 15PNIJ-21-GG-04143-MUMU_Kintelligence_GEDmatchPRO_MatchMetrics.xlsx.
Sample Name: Given sample identification name, combination of donor code, artificial degradation method, and DNA input for assay testing.
Donor Name: Donor code utilized throughout testing.
Technology: Designates the assay method used for sample generation.
Study: Designates if the sample is an artificially degraded test sample or an undegraded, high quality reference DNA extract used for concordance comparisons.
Sample Type - Comments: Designates the origin of the DNA sample.
Degradation Method: Designates the method of artificial degradation applied to the sample.
Exposure Time Point: Categorical designation of exposure time 1 = lowest exposure, 4 = longest exposure.
Exposure Time Length: Specifies the numerical length of exposure time.
Total DNA input (ng): The total amount of DNA added to the library preparation assay for sample processing.
Expected Relationship: The known and documented family relationship between known relative and sample donor.
Expected Relationship Degree: The distance classification of the documented relationship between the known relative and sample donor. Designated by the One-to-Many Kinship Generation Chart in GEDmatchPRO.
Sample Kit Database Matching comments: Describes the GEDmatch PRO One-to-Many comparison outcome, whether the match was detected and if the match fell within the expected relationship centimorgan range.
Shared cM: The calculated total shared content in centimorgans (cM) between the known relative and experimental sample kit determined with One-to-Many Kinship matching.
Longest segment (cM): The calculated length of the longest shared segment between the known relative and experimental sample kit determined with One-to-Many Kinship matching.
Segment Count: Total number of shared segements between known relative and experimental sample kit determined with One-to-Many Kinship matching.
Mean segment cM: Total calculated shared DNA content in cM divided by the total number of DNA segments determined with One-to-Many Kinship matching.
SNP Overlap: The total number of shared SNPs evaluated by both the known relative's kit and the experimental sample's kit or the number of SNPs that are being used to compare two kits.
Whole Genome Kinship Coefficient: a probability that SNP alleles from two individuals at the same location on a genome are identical by descent. Calculated with One-to-One Kinship Matching in GEDmatch PRO.
Cells containing “n/a” indicate non-applicable metrics for the corresponding sample.
GEDmatch PRO matching metrics for sample kits derived from GSAv2 BEADCHIP genotyping results are compiled in files 15PNIJ-21-GG-04143-MUMU_GSAv2_GEDmatchPRO_MatchMetrics.xlsx.
Sample Name: Given sample identification name, combination of donor code, artificial degradation method, and DNA input for assay testing.
Donor Name: Donor code utilized throughout testing.
Technology: Designates the assay method used for sample generation.
Study: Designates if the sample is an artificially degraded test sample or an undegraded, high quality reference DNA extract used for concordance comparisons.
Sample Type - Comments: Designates the origin of the DNA sample.
Degradation Method: Designates the method of artificial degradation applied to the sample.
Exposure Time Point: Categorical designation of exposure time 1 = lowest exposure, 4 = longest exposure.
Exposure Time Length: Specifies the numerical length of exposure time.
Total DNA input (ng): The total amount of DNA added to the library preparation assay for sample processing.
Expected Relationship: The known and documented family relationship between known relative and sample donor.
Sample Kit Database Matching comments: Describes the GEDmatch PRO One-to-Many comparison outcome, whether the match was detected and if the match fell within the expected relationship centimorgan range.
One-to-Many Segment Based Shared cM: The calculated total shared content in centimorgans (cM) between the known relative and experimental sample kit determined with One-to-Many Segmant-Based DNA matching.
Longest segment (cM): The calculated length of the longest shared segment between the known relative and experimental sample kit determined with One-to-Many Kinship matching.
Generation estimate of match: Estimate of the number of generations to the most recent common ancestor shared by two kits matching in GEDmatch PRO.
SNP Overlap: The total number of shared SNPs evaluated by both the known relative's kit and the experimental sample's kit or the number of SNPs that are being used to compare two kits.
Cells containing n/a indicate non-applicable metrics for the corresponding sample.
“null” indicates missing data in cells of columns K, L, M, N, and O due to removal of the corresponding sample from GEDmatch before the metric was collected.
GEDmatch PRO matching metrics for sample kits derived from GENOME SEQUENCING genotyping results are compiled in files 15PNIJ-21-GG-04143-MUMU_GenomeSequencing_GEDmatchPRO_MatchMetrics.xlsx.
Sample Name: Given sample identification name, combination of donor code, artificial degradation method, and DNA input for assay testing.
Donor Name: Donor code utilized throughout testing.
Technology: Designates the assay method used for sample generation.
Study: Designates if the sample is an artificially degraded test sample or an undegraded, high quality reference DNA extract used for concordance comparisons.
Sample Type - Comments: Designates the origin of the DNA sample.
Degradation Method: Designates the method of artificial degradation applied to the sample.
Exposure Time Point: Categorical designation of exposure time 1 = lowest exposure, 4 = longest exposure.
Exposure Time Length: Specifies the numerical length of exposure time.
Total DNA input (ng): The total amount of DNA added to the library preparation assay for sample processing.
Expected Relationship: The known and documented family relationship between known relative and sample donor.
Sample Kit Database Matching comments: Describes the GEDmatch PRO One-to-Many comparison outcome, whether the match was detected and if the match fell within the expected relationship centimorgan range.
One-to-Many Segment Based Shared cM: The calculated total shared content in centimorgans (cM) between the known relative and experimental sample kit determined with One-to-Many Segmant-Based DNA matching.
Longest segment (cM): The calculated length of the longest shared segment between the known relative and experimental sample kit determined with One-to-Many Kinship matching.
Generation estimate of match: Estimate of the number of generations to the most recent common ancestor shared by two kits matching in GEDmatch PRO.
SNP Overlap: The total number of shared SNPs evaluated by both the known relative's kit and the experimental sample's kit or the number of SNPs that are being used to compare two kits.
Cells containing n/a indicate non-applicable metrics for the corresponding sample.
“null” indicates missing data in one cell of column O due to removal of the corresponding sample from GEDmatch before the metric was collected.
Anonymized GEDmatch PRO match lists comparing matches sharing approximately 50 cM or more are provided for each donor for all Phase I SENSITIVITY and ARTIFICIAL DEGRADATION experiments in the file labeled 15PNIJ-21-GG-04143-MUMU_GEDmatchPRO_MatchLists.doc.
Raw sequence data and final SNP genotypes are not provide to protect donor privacy.
Kit IDs of test samples or matching relatives are not provided to protect donor privacy.
Refer to the Methods provided with the dataset for more information.
Sharing/access Information
Links to other publicly accessible locations of the data: https://nij.ojp.gov/funding/awards/15pnij-21-gg-04143-mumu
A link to the Dryad dataset will also be accessible through the National Archive of Criminal Justice Data (NACJD): https://www.icpsr.umich.edu/web/pages/NACJD/.
Was data derived from another source? No
DNA extracts were genotyped by Illumina Global Screening Array BeadChip, whole genome sequencing (WGS) on NovaSeq 6000, and targeted sequencing with Qiagen/Verogen ForenSeq Kintelligence Kit to evaluate the technology-specific sensitivity to low-level DNA input concentrations and specificity for artificially degraded DNA using whole semen and nascent semen DNA samples. To generate genotype call metrics using Qiagen ForenSeq Kintelligence, sequence libraries were prepared following manufacturer’s recommended protocol for library preparation, library prep quality control, and MiSeq FGx sequencing. Raw sequence processing and genotype calling was performed in Qiagen/Verogen Universal Analysis Software with a 1.5% (10X coverage) analytical threshold.
To generate genotype call metrics using genome sequencing, dsDNA libraries were prepared using internally optimized, proprietary protocol, pooled, and sequenced on the Illumina NovaSeq 6000 in 2x150 reads to a target depth of 30X coverage. Read quality filtering, mapping, alignment to hg38_hs38DH reference genome, and allele calling was processed with DRAGEN™ (Illumina, San Diego, CA) v07.021.609.3.9.3, kernel release 3.10.0-1160.42.2.el7.x86_64. Customized scripts were used to parse whole genome allele calls to a target set of 2,061,275 SNPs and format genotypes for GEDmatch upload.
To generate genotype call metrics using Illumina Global Screening Array v2, an internally optimized workflow was followed at Gene by Gene (Houston, TX) to hybridize DNA extracts to a custom built GSAv2 BeadChip. Scanning was performed on the Illumina iScan. Genotype calling was performed in GenomeStudio® v2.0 Genotyping Module.
Customized Excel workbooks and SQL database were created to compare genotypes across technologies/determine concordant genotypes, calculate call rates, and determine heterozygosity of autosomal loci. Technology-specific processing quality metrics were compiled for all samples, and SNP genotyping results were evaluated for call rate and concordance to the known reference.
GEDmatch PRO sample metrics were generated by uploading formatted genotype files to through the GEDmatch PRO portal. GSAv2 genotype files and genome sequencing genotype files were matched to the database using the One-to-Many Segment Based algorithm. Kintelligence genotype files were matched to the database using the One-to-Many Kinship algorithm.
Usage notes:
Data quality metrics and SNP genotyping metrics are compiled in .xlsx or .csv tables for each of the three technologies. Sensitivity and degraded sample results are provided in separate files per technology. Neither SNP genotypes nor sequence data are not provided to maintain donor privacy.