Data and source code for: ClinVar and HGMD genomic variant classification accuracy has improved over time, as measured by implied disease burden

Sharo, Andrew 1 ; Zou, Yangyun1; Adhikari, Aashish1; Brenner, Steven1

Published Oct 26, 2022 on Dryad. https://doi.org/10.6078/D1872X

Data files

Oct 26, 2022 version files 3.77 GB

dryadData.tar.gz

3.77 GB
README.md

8.44 KB

Abstract

Curated databases of genetic variants assist clinicians and researchers in interpreting genetic testing results. Yet these databases contain variants annotated as pathogenic that do not result in pathogenic phenotypes. Using archives of ClinVar and HGMD, we investigated how variant misclassification has changed over six years across different ancestry groups. We considered inborn errors of metabolism (IEMs) screened in newborns as a model system, because these disorders are often highly penetrant with neonatal phenotypes. We used samples from the 1000 Genomes Project (1KGP) to identify individuals with genotypes that were annotated by the databases as pathogenic. Due to the rarity of IEMs, nearly all such annotated pathogenic genotypes indicate likely variant misclassification in ClinVar or HGMD. While the false positive rates of both ClinVar and HGMD have improved over time, HGMD variants currently would imply two orders of magnitude more affected individuals in 1KGP than ClinVar variants. We observed that African ancestry individuals have a significantly increased chance of being incorrectly indicated to be affected by a screened IEM when HGMD variants are used. However, this bias affecting genomes of African ancestry was no longer significant once common variants were removed in accordance with recent variant interpretation guidelines. We discovered that ClinVar variants classified as Pathogenic or Likely Pathogenic are reclassified 11-fold more often than DM or DM? variants in HGMD, which has likely resulted in ClinVar’s lower false positive rate. Considering misclassified variants that have since been reclassified, we found that variant interpretation guidelines and allele frequency databases comprised of genetically diverse samples are important factors in reclassification. Finally, we find that ClinVar variants common in European and South Asian individuals were more likely to be reclassified to a lower confidence category, perhaps due to an increased chance of these variants being annotated by multiple submitters.

This analysis was performed with Jupyter notebooks, so all code is in ipynb files. We recommend running these files using Jupyter, which can easily be installed using conda. The notebooks should function in a python 3.8 environment. Note that the visualizations in the three Floweaver*.ipynb files will work only in a Jupyter notebook environment and not in a Jupyter lab environment. If you have any questions about running these files, please contact asharo@ucsc.edu and brenner@compbio.berkeley.edu

The following python packages are required to run these notebooks:

Pandas
cyvcf2
numpy
matplotlib
pickle
joblib
floweaver
ipysankeywidget

To reproduce the analysis in full, and to understand the logical flow, you must run the notebooks in the below order. However, if you are interested in a specific analysis, all intermediate files have also been provided, so in practice, you may run notebooks out of order. Due to restrictions on HGMD data sharing, primary and intermediate HGMD files are not provided. Please contact the authors if you would like to run this analysis with your own archived HGMD files. Most paths in these files have been hardcoded. The file names are identical to those provided in the Dryad repository, but you will need to replace the paths with the location of the Dryad data files. Each of the below notebooks is followed by a brief description of its function.

TGP and GnomAD ancestry proportions bar chart.ipynb – plot 1KGP and gnomAD ancestry
primaryAnalysis-ClinVar-allYears.ipynb – analyze ClinVar implied disease over time in 1KGP
NormalizedHGMDandClinVarIndividualsOverTime.ipynb – generate figures
primaryAnalysis-on Yangyun HGMD-updated.ipynb – preprocessing of 2014 HGMD data
primaryAnalysis-HGMD-2016-updated.ipynb – preprocessing of 2016 HGMD data
primaryAnalysis-HGMD-updated.ipynb – preprocessing of 2020 HGMD data
primaryAnalysis-HGMD-allYears.ipynb – analyze HGMD implied disease over time in 1KGP
NormalizedHGMDandClinVarIndividualsOverTime-Ghosh.ipynb – generate figure
primaryAnalysis-HGMD-gnomADhoms.ipynb – repeat analysis with gnomAD data
primaryAnalysis-ClinVar-gnomADhoms.ipynb – repeat analysis with gnomAD data
NormalizedHGMDandClinVarIndividualsOverTime-gnomAD-Richards.ipynb – generate figure
NormalizedHGMDandClinVarIndividualsOverTime-gnomAD-Ghosh.ipynb – generate figure
UnifiedFigureOfIncidence.ipynb – generate figures
primaryAnalysis-ClinVar-varReclassBiasGnomAD-v2-binary heterozygosity-ThresholdAF-equally-reclass per var month-gnomAD exomes-updatedCategoriesRateOverAllTime.ipynb – generate figures
Floweaver sankey HGMD.ipynb – generate figures
Floweaver sankey ClinVar-v4.ipynb – generate figures
Floweaver sankey ClinVar-v6.ipynb – generate figures

Figures from Sharo et al. manuscript mapped to their corresponding notebook:

1a: NormalizedHGMDandClinVarIndividualsOverTime.ipynb
1b: TGP and GnomAD ancestry proportions bar chart.ipynb
1c: primaryAnalysis-ClinVar-allYears.ipynb
1d: primaryAnalysis-ClinVar-allYears.ipynb
1e: primaryAnalysis-HGMD-allYears.ipynb
1f: primaryAnalysis-HGMD-allYears.ipynb
1g: NormalizedHGMDandClinVarIndividualsOverTime.ipynb
2a:Floweaver sankey ClinVar-v4.ipynb
2b: Floweaver sankey HGMD.ipynb
2c: Floweaver sankey ClinVar-v6.ipynb
2d: primaryAnalysis-ClinVar-varReclassBiasGnomAD-v2-binary heterozygosity-ThresholdAF-equally-reclass per var month-gnomAD exomes-updatedCategoriesRateOverAllTime.ipynb
2e: Floweaver sankey ClinVar-v6.ipynb
2f: primaryAnalysis-ClinVar-varReclassBiasGnomAD-v2-binary heterozygosity-ThresholdAF-equally-reclass per var month-gnomAD exomes-updatedCategoriesRateOverAllTime.ipynb
S1a,b: primaryAnalysis-ClinVar-allYears.ipynb
S2a,b: primaryAnalysis-ClinVar-allYears.ipynb
S2c,d: primaryAnalysis-HGMD-allYears.ipynb
S2e: NormalizedHGMDandClinVarIndividualsOverTime-Ghosh.ipynb
S3a,b,c,d: UnifiedFigureOfIncidence.ipynb
S4a,b,c,d: UnifiedFigureOfIncidence.ipynb
S5a,b,c,d: UnifiedFigureOfIncidence.ipynb
S6a: NormalizedHGMDandClinVarIndividualsOverTime.ipynb
S6b: TGP and GnomAD ancestry proportions bar chart.ipynb
S6c: primaryAnalysis-ClinVar-gnomADhoms.ipynb
S6d: primaryAnalysis-ClinVar-gnomADhoms.ipynb
S6e: primaryAnalysis-HGMD-gnomADhoms
S6f: primaryAnalysis-HGMD-gnomADhoms
S6g: NormalizedHGMDandClinVarIndividualsOverTime-gnomAD-Richards.ipynb
S7a: NormalizedHGMDandClinVarIndividualsOverTime.ipynb
S7b: TGP and GnomAD ancestry proportions bar chart.ipynb
S7c: primaryAnalysis-ClinVar-gnomADhoms.ipynb
S7d: primaryAnalysis-ClinVar-gnomADhoms.ipynb
S7e: primaryAnalysis-HGMD-gnomADhoms
S7f: primaryAnalysis-HGMD-gnomADhoms
S7g: NormalizedHGMDandClinVarIndividualsOverTime-gnomAD-Ghosh.ipynb
S8a,b,c,d: UnifiedFigureOfIncidence.ipynb
S9a,b,c,d: UnifiedFigureOfIncidence.ipynb
S10a,b,c,d: UnifiedFigureOfIncidence.ipynb
S11: primaryAnalysis-ClinVar-varReclassBiasGnomAD-v2-binary heterozygosity-ThresholdAF-equally-reclass per var month-gnomAD exomes-updatedCategoriesRateOverAllTime.ipynb
S12: Floweaver sankey ClinVar-v4.ipynb

Directory of Data present:

20130606_sample_info.csv – 1KGP sample ancestries
80.genes.160721.txt – list of 80 metabolic genes
ALL.chr1-Y_GRCh38.genotypes.20170504.split.vep.80mets.sorted.noCSQ.header.clinVarYearsVCF.exac.vcf.gz – 1KGP variants annotated with ClinVar and allele frequency
ClinVarDenominators20210625.csv – intermediate file
ClinVar.Multiyear.gnomAD.esp.homs.vep.vcf.gz – ClinVar variants annotated with gnomAD allele frequency
ClinVar.Multiyear.tgp.gnomAD.20210527.vep.vcf.gz – ClinVar variants annotated with classification through time and allele frequency.
ClinVarReclassCV.csv – intermediate file
ClinVarReclassDfChaMerged.csv – intermediate file
ClinVarReclassForSankeyChangeKey.csv – counts of variant reclassifications
clinVarVCF/ - folder of original and normalized ClinVar variant classifications.
clnrevLst.pickle – list of all dates with clnvar review status
clnsigLst.pickle – list of all dates with clnvar clinical significance.
dfChaMergedv6.csv – intermediate file
dloads1LstFlat.pickle – intermediate file
dloads2LstFlat.csv – intermediate file
hgmdDenom2014.csv – number of relevant HGMD variants in 2014
hgmdDenom2016.csv – number of relevant HGMD variants in 2016
hgmdDenom2020.csv – number of relevant HGMD variants in 2020
HGMD.fullghosh.pickle – intermediate file
HGMD.selectghosh.pickle – intermediate file
incidenceHGMD*.joblib – intermediate file
knownSmallCats.csv – counts of variant reclassifications
metsLst80.pickle – list of 80 metabolic genes
reclassDctToRage.csv – reclassification counts by ancestry
smallRelativeHet.csv – intermediate file
superpopulation_key.csv – 1KGP superpopulations
toplot*.pickle – intermediate file
totalSmallCats.csv – variant reclassification counts

Data and source code for: ClinVar and HGMD genomic variant classification accuracy has improved over time, as measured by implied disease burden

Data files

Abstract

Usage notes

Works referencing this dataset