Skip to main content
Dryad logo

Data and source code for: ClinVar and HGMD genomic variant classification accuracy has improved over time, as measured by implied disease burden

Citation

Sharo, Andrew; Zou, Yangyun; Adhikari, Aashish; Brenner, Steven (2022), Data and source code for: ClinVar and HGMD genomic variant classification accuracy has improved over time, as measured by implied disease burden, Dryad, Dataset, https://doi.org/10.6078/D1872X

Abstract

Curated databases of genetic variants assist clinicians and researchers in interpreting genetic testing results. Yet these databases contain variants annotated as pathogenic that do not result in pathogenic phenotypes. Using archives of ClinVar and HGMD, we investigated how variant misclassification has changed over six years across different ancestry groups. We considered inborn errors of metabolism (IEMs) screened in newborns as a model system, because these disorders are often highly penetrant with neonatal phenotypes. We used samples from the 1000 Genomes Project (1KGP) to identify individuals with genotypes that were annotated by the databases as pathogenic. Due to the rarity of IEMs, nearly all such annotated pathogenic genotypes indicate likely variant misclassification in ClinVar or HGMD. While the false positive rates of both ClinVar and HGMD have improved over time, HGMD variants currently would imply two orders of magnitude more affected individuals in 1KGP than ClinVar variants. We observed that African ancestry individuals have a significantly increased chance of being incorrectly indicated to be affected by a screened IEM when HGMD variants are used. However, this bias affecting genomes of African ancestry was no longer significant once common variants were removed in accordance with recent variant interpretation guidelines. We discovered that ClinVar variants classified as Pathogenic or Likely Pathogenic are reclassified 11-fold more often than DM or DM? variants in HGMD, which has likely resulted in ClinVar’s lower false positive rate. Considering misclassified variants that have since been reclassified, we found that variant interpretation guidelines and allele frequency databases comprised of genetically diverse samples are important factors in reclassification. Finally, we find that ClinVar variants common in European and South Asian individuals were more likely to be reclassified to a lower confidence category, perhaps due to an increased chance of these variants being annotated by multiple submitters.  

Usage Notes

This analysis was performed with Jupyter notebooks, so all code is in ipynb files. We recommend running these files using Jupyter, which can easily be installed using conda. The notebooks should function in a python 3.8 environment. Note that the visualizations in the three Floweaver*.ipynb files will work only in a Jupyter notebook environment and not in a Jupyter lab environment. If you have any questions about running these files, please contact asharo@ucsc.edu and brenner@compbio.berkeley.edu

The following python packages are required to run these notebooks:

  • Pandas
  • cyvcf2
  • numpy
  • matplotlib
  • pickle
  • joblib
  • floweaver
  • ipysankeywidget

To reproduce the analysis in full, and to understand the logical flow, you must run the notebooks in the below order. However, if you are interested in a specific analysis, all intermediate files have also been provided, so in practice, you may run notebooks out of order. Due to restrictions on HGMD data sharing, primary and intermediate HGMD files are not provided. Please contact the authors if you would like to run this analysis with your own archived HGMD files. Most paths in these files have been hardcoded. The file names are identical to those provided in the Dryad repository, but you will need to replace the paths with the location of the Dryad data files. Each of the below notebooks is followed by a brief description of its function.

  • TGP and GnomAD ancestry proportions bar chart.ipynb – plot 1KGP and gnomAD ancestry
  • primaryAnalysis-ClinVar-allYears.ipynb – analyze ClinVar implied disease over time in 1KGP
  • NormalizedHGMDandClinVarIndividualsOverTime.ipynb – generate figures
  • primaryAnalysis-on Yangyun HGMD-updated.ipynb – preprocessing of 2014 HGMD data
  • primaryAnalysis-HGMD-2016-updated.ipynb – preprocessing of 2016 HGMD data
  • primaryAnalysis-HGMD-updated.ipynb – preprocessing of 2020 HGMD data
  • primaryAnalysis-HGMD-allYears.ipynb – analyze HGMD implied disease over time in 1KGP
  • NormalizedHGMDandClinVarIndividualsOverTime-Ghosh.ipynb – generate figure
  • primaryAnalysis-HGMD-gnomADhoms.ipynb – repeat analysis with gnomAD data
  • primaryAnalysis-ClinVar-gnomADhoms.ipynb – repeat analysis with gnomAD data
  • NormalizedHGMDandClinVarIndividualsOverTime-gnomAD-Richards.ipynb – generate figure
  • NormalizedHGMDandClinVarIndividualsOverTime-gnomAD-Ghosh.ipynb – generate figure
  • UnifiedFigureOfIncidence.ipynb – generate figures
  • primaryAnalysis-ClinVar-varReclassBiasGnomAD-v2-binary heterozygosity-ThresholdAF-equally-reclass per var month-gnomAD exomes-updatedCategoriesRateOverAllTime.ipynb – generate figures
  • Floweaver sankey HGMD.ipynb – generate figures
  • Floweaver sankey ClinVar-v4.ipynb – generate figures
  • Floweaver sankey ClinVar-v6.ipynb – generate figures

Figures from Sharo et al. manuscript mapped to their corresponding notebook:

  • 1a: NormalizedHGMDandClinVarIndividualsOverTime.ipynb
  • 1b: TGP and GnomAD ancestry proportions bar chart.ipynb
  • 1c: primaryAnalysis-ClinVar-allYears.ipynb
  • 1d: primaryAnalysis-ClinVar-allYears.ipynb
  • 1e: primaryAnalysis-HGMD-allYears.ipynb
  • 1f: primaryAnalysis-HGMD-allYears.ipynb
  • 1g: NormalizedHGMDandClinVarIndividualsOverTime.ipynb
  • 2a:Floweaver sankey ClinVar-v4.ipynb
  • 2b: Floweaver sankey HGMD.ipynb
  • 2c: Floweaver sankey ClinVar-v6.ipynb
  • 2d: primaryAnalysis-ClinVar-varReclassBiasGnomAD-v2-binary heterozygosity-ThresholdAF-equally-reclass per var month-gnomAD exomes-updatedCategoriesRateOverAllTime.ipynb
  • 2e: Floweaver sankey ClinVar-v6.ipynb
  • 2f: primaryAnalysis-ClinVar-varReclassBiasGnomAD-v2-binary heterozygosity-ThresholdAF-equally-reclass per var month-gnomAD exomes-updatedCategoriesRateOverAllTime.ipynb
  • S1a,b: primaryAnalysis-ClinVar-allYears.ipynb
  • S2a,b: primaryAnalysis-ClinVar-allYears.ipynb
  • S2c,d: primaryAnalysis-HGMD-allYears.ipynb
  • S2e: NormalizedHGMDandClinVarIndividualsOverTime-Ghosh.ipynb
  • S3a,b,c,d: UnifiedFigureOfIncidence.ipynb
  • S4a,b,c,d: UnifiedFigureOfIncidence.ipynb
  • S5a,b,c,d: UnifiedFigureOfIncidence.ipynb
  • S6a: NormalizedHGMDandClinVarIndividualsOverTime.ipynb
  • S6b: TGP and GnomAD ancestry proportions bar chart.ipynb
  • S6c: primaryAnalysis-ClinVar-gnomADhoms.ipynb
  • S6d: primaryAnalysis-ClinVar-gnomADhoms.ipynb
  • S6e: primaryAnalysis-HGMD-gnomADhoms
  • S6f: primaryAnalysis-HGMD-gnomADhoms
  • S6g: NormalizedHGMDandClinVarIndividualsOverTime-gnomAD-Richards.ipynb
  • S7a: NormalizedHGMDandClinVarIndividualsOverTime.ipynb
  • S7b: TGP and GnomAD ancestry proportions bar chart.ipynb
  • S7c: primaryAnalysis-ClinVar-gnomADhoms.ipynb
  • S7d: primaryAnalysis-ClinVar-gnomADhoms.ipynb
  • S7e: primaryAnalysis-HGMD-gnomADhoms
  • S7f: primaryAnalysis-HGMD-gnomADhoms
  • S7g: NormalizedHGMDandClinVarIndividualsOverTime-gnomAD-Ghosh.ipynb
  • S8a,b,c,d: UnifiedFigureOfIncidence.ipynb
  • S9a,b,c,d: UnifiedFigureOfIncidence.ipynb
  • S10a,b,c,d: UnifiedFigureOfIncidence.ipynb
  • S11: primaryAnalysis-ClinVar-varReclassBiasGnomAD-v2-binary heterozygosity-ThresholdAF-equally-reclass per var month-gnomAD exomes-updatedCategoriesRateOverAllTime.ipynb
  • S12: Floweaver sankey ClinVar-v4.ipynb

Directory of Data present:

  • 20130606_sample_info.csv – 1KGP sample ancestries
  • 80.genes.160721.txt – list of 80 metabolic genes
  • ALL.chr1-Y_GRCh38.genotypes.20170504.split.vep.80mets.sorted.noCSQ.header.clinVarYearsVCF.exac.vcf.gz – 1KGP variants annotated with ClinVar and allele frequency
  • ClinVarDenominators20210625.csv – intermediate file
  • ClinVar.Multiyear.gnomAD.esp.homs.vep.vcf.gz – ClinVar variants annotated with gnomAD allele frequency
  • ClinVar.Multiyear.tgp.gnomAD.20210527.vep.vcf.gz – ClinVar variants annotated with classification through time and allele frequency.
  • ClinVarReclassCV.csv – intermediate file
  • ClinVarReclassDfChaMerged.csv – intermediate file
  • ClinVarReclassForSankeyChangeKey.csv – counts of variant reclassifications
  • clinVarVCF/ - folder of original and normalized ClinVar variant classifications.
  • clnrevLst.pickle – list of all dates with clnvar review status
  • clnsigLst.pickle – list of all dates with clnvar clinical significance.
  • dfChaMergedv6.csv – intermediate file
  • dloads1LstFlat.pickle – intermediate file
  • dloads2LstFlat.csv – intermediate file
  • hgmdDenom2014.csv – number of relevant HGMD variants in 2014
  • hgmdDenom2016.csv – number of relevant HGMD variants in 2016
  • hgmdDenom2020.csv – number of relevant HGMD variants in 2020
  • HGMD.fullghosh.pickle – intermediate file
  • HGMD.selectghosh.pickle – intermediate file
  • incidenceHGMD*.joblib – intermediate file
  • knownSmallCats.csv – counts of variant reclassifications
  • metsLst80.pickle – list of 80 metabolic genes
  • reclassDctToRage.csv – reclassification counts by ancestry
  • smallRelativeHet.csv – intermediate file
  • superpopulation_key.csv – 1KGP superpopulations
  • toplot*.pickle – intermediate file
  • totalSmallCats.csv – variant reclassification counts

Funding

National Science Foundation, Award: DGE 1752814

National Institutes of Health, Award: P01 AI138962

National Science Foundation, Award: 2109912

National Institutes of Health, Award: U19 HD077627

National Institutes of Health, Award: U41 HG007346