Source code for StrVCTVRE: a supervised learning method to predict the pathogenicity of human genome structural variants

Sharo, Andrew 1 ; Hu, Zhiqiang1 ; Sunyaev, Shamil2 ; Brenner, Steven1

Published Aug 24, 2021; Updated Oct 26, 2021 on Dryad. https://doi.org/10.6078/D1GM63

Data files

Aug 24, 2021 version files 16.34 GB

StrVCTVRE_source_data.tar.gz
16.34 GB

Oct 26, 2021 version files 16.34 GB

StrVCTVRE_source_data.tar.gz
16.34 GB

Abstract

Whole genome sequencing resolves many clinical cases where standard diagnostic methods have failed. However, at least half of these cases remain unresolved after whole genome sequencing. Structural variants (SVs; genomic variants larger than 50 base pairs) of uncertain significance are the genetic cause of a portion of these unresolved cases. As sequencing methods using long or linked reads become more accessible and SV detection algorithms improve, clinicians and researchers are gaining access to thousands of reliable SVs of unknown disease relevance. Methods to predict the pathogenicity of these SVs are required to realize the full diagnostic potential of long-read sequencing. To address this emerging need, we developed StrVCTVRE to distinguish pathogenic SVs from benign SVs that overlap exons. In a random forest classifier, we integrated features that capture gene importance, coding region, conservation, expression, and exon structure. We found that features such as expression and conservation are important but are absent from SV classification guidelines. We leveraged multiple resources to construct a size-matched training set of rare, putatively benign and pathogenic SVs. StrVCTVRE performs accurately across a wide SV size range on independent test sets, which will allow clinicians and researchers to eliminate about half of SVs from consideration while retaining a 90% sensitivity. We anticipate clinicians and researchers will use StrVCTVRE to prioritize SVs in patients where no SV is immediately compelling, empowering deeper investigation into novel SVs to resolve cases and understand new mechanisms of disease. StrVCTVRE runs rapidly and is available at https://compbio.berkeley.edu/proj/strvctvre/.

Thank you for downloading the StrVCTVRE source code! StrVCTVRE was developed with Jupyter notebook, so most code is in ipynb files. I recommend running these files using jupyter, which can easily be installed in conda. The notebooks were developed in a python 2 environment, and likely will not work in a python 3 environment.

If you are interested in using pre-trained StrVCTVRE algorithm to score SVs, please see https://compbio.berkeley.edu/proj/strvctvre/ for instructions to run StrVCTVRE on your SVs.

The following is required to run these notebooks.

Python packages:
- numpy
- pandas
- joblib
- scikit-learn
- pybedtools
- cyvcf2
- pybigwig
- matplotlib
- scipy
- pprint
- pickle
Linux environment with:
- bedtools
- SVScore (optional; see below)
- VEP (optional; see below)
- AnnotSV (optional; see below)
- X-CNV (optional; see below)
- CADD-SV (optional; see below)

Since there is some complexity to installing and running the SV annotation methods, they are optional, and I have provided post-annotation files so these methods do not need to be run.

If you prefer to run the SV annotation methods from scratch, you will need to uncomment the sections of the analysis where they are run.

Due to the several methods being run, most of the paths in the code have been hardcoded. These can either be replaced with your path of preference, or, perhaps preferably, you can create the file path /data/andrewsharo-S/thesis/Aim1 on your own system and put all files in that folder. Additionally, all figures will output to /h/andrewsharo/JupyterProjects/Thesis/Aim1/SVClassifier/Figs/, so it may be easiest to create that path on your system.

To recreate the entire manuscript, the notebooks must be run in the following order:
DataSelectionFinal.ipynb
MultipleTrainingSets-ClinVarOnly.ipynb
MultipleTrainingSets-5Categories.ipynb
CorrelationMatrix-v2.ipynb
RawTestOnChrs1357.ipynb
GeneDistributionAnalysis-v4(2020ClinVar)-ForShamil-OneVarPerGene.ipynb
MultipleTrainingSets-ClinVarOnly-geneNormalizedOneVarPerGene.ipynb
MultipleTrainingSets-5Categories-rawTestOnNormalizedOneVarPerGene-chrs1357.ipynb
CompareToSVScoreAndVeP-v3-ClinVar2020-chrs1357-OneVarPerGene.ipynb
CompareToSVScoreAndVeP-v3-ClinVar2020-chrs1357.ipynb
MultipleTrainingSets-5Categories-ClinVar vs All Data.ipynb
MultipleTrainingSets-5Categories-AllChrs-OutputRFs.ipynb
readDecipherVariants-HighContribution-ForShamil.ipynb (Decipher data must be requested)
MultipleTrainingSets-5Categories-AllChrs-TestDecipher.ipynb (Decipher data must be requested)
CompareToAnnotSV.ipynb
DominantRecessiveModel-ForShamil.ipynb
MultipleTrainingSets-ClinVarOnly-SizeDistributionFixed-DomRec-ForShamil.ipynb
MultipleTrainingSets-ClinVarOnly-SizeDistributionFixed-DomRec-ForShamil-ConfidentGenesOnly.ipynb
MultipleTrainingSets-5Categories-DomRec-ForShamil.ipynb
MultipleTrainingSets-5Categories-DomRec-ForShamil-OnlyConfidentGenes.ipynb
MultipleTrainingSets-5Categories-DomRec-ForShamil-UnifiedFigure.ipynb
AnalyzeODonnellData-v2.ipynb (full SVs must be requested from author)
GeneDistributionAnalysis-v2.ipynb
ThousandGenomeAnalysis.ipynb
Conservation Distribution.ipynb
RawTestOnChrs1357NoDedupNoDeCommonNoSizeLimit.ipynb
RawTestOnChrs1357<1Mb.ipynb
RawTestOnChrs1357NoDedupNoDeCommon.ipynb
CompareSVScoreAndStrVCTVRE_Runtime.ipynb
MultipleTrainingSets-5Categories-geneNormalized-testonRaw-Reviewers.ipynb
MultipleTrainingSets-5Categories-ClinVar vs All Data-forReviewer.ipynb
Compare To CADD-SV and XCNV chrs1357.ipynb
MultipleTrainingSets-5Categories-ClinVar vs All Data-CADD-SV and X-CNV.ipynb
readDecipherVariants-HighContribution-ForShamil-XCNV-only.ipynb
readDecipherVariants-HighContribution-ForShamil-CADDSV.ipynb

Here is a list of figures from the Sharo et al. manuscript mapped to their corresponding notebook:
1a: CorrelationMatrix-v2.ipynb
1b: MultipleTrainingSets-5Categories.ipynb
2: MultipleTrainingSets-5Categories-ClinVar vs All Data.ipynb
3: MultipleTrainingSets-5Categories.ipynb
4: readDecipherVariants-HighContribution-ForShamil.ipynb (Decipher data must be requested)
5a: ThousandGenomeAnalysis.ipynb
5b: readDecipherVariants-HighContribution-ForShamil.ipynb (Decipher data must be requested)
S1: Conservation Distribution.ipynb
S2: GeneDistributionAnalysis-v2.ipynb
S3: DataSelectionFinal.ipynb
S4: MultipleTrainingSets-5Categories.ipynb
S5: CompareToSVScoreAndVeP-v3-ClinVar2020-chrs1357.ipynb
S6: MultipleTrainingSets-5Categories-ClinVar vs All Data-CADD-SV and X-CNV.ipynb
S7a: RawTestOnChrs1357NoDedupNoDeCommon.ipynb
S7b: RawTestOnChrs1357NoDedupNoDeCommonNoSizeLimit.ipynb
S7c: RawTestOnChrs1357<1Mb.ipynb
S8: MultipleTrainingSets-5Categories-ClinVar vs All Data-forReviewer.ipynb
S9a: readDecipherVariants-HighContribution-ForShamil-XCNV-only.ipynb
S9b: readDecipherVariants-HighContribution-ForShamil-CADDSV.ipynb
S10a: readDecipherVariants-HighContribution-ForShamil-XCNV-only.ipynb
S10b: readDecipherVariants-HighContribution-ForShamil-CADDSV.ipynb
S11: CompareToAnnotSV.ipynb
S12: CompareSVScoreAndStrVCTVRE_Runtime.ipynb
S13: DominantRecessiveModel-ForShamil.ipynb
S14: MultipleTrainingSets-5Categories-DomRec-ForShamil-UnifiedFigure.ipynb

To retrain StrVCTVRE on updated structural variants, run the following notebooks:
DataSelectionFinal.ipynb - provide new structural variants to this notebook.
MultipleTrainingSets-5Categories-AllChrs-OutputRFs.ipynb - The output of this notebook should replace the existing trained rf files.

List of other findings in the paper mapped to corresponding notebook:
All numbers relating to training data: DataSelectionFinal.ipynb
Matching algorithm for pathogenic and benign variants: MultipleTrainingSets-ClinVarOnly.ipynb, MultipleTrainingSets-5Categories.ipynb
Performance on recently discovered SVs from undiagnosed neuromuscular and retinal disorder cases: AnalyzeODonnellData-v2.ipynb (full SVs must be requested from author)
1000 Genomes Project performance: ThousandGenomeAnalysis.ipynb
Determination of StrVCTVRE Odds Pathogenicity cut-off: MultipleTrainingSets-5Categories-AllChrs-TestDecipher.ipynb

Directory of Data present:
20191219_dbVar_pathogenic_NR_SV.formatted.sorted.bed - SVs trained on by AnnotSV
ALL.wgs.mergedSV.v8.20130502.svs.genotypes.END.only.SVScored.vcf - subset of 1KGP SVs, only dels and dups
ALL.wgs.mergedSV.v8.20130502.svs.genotypes.GRCh38.vcf.gz - all of 1KGP SVs
clinvar_atleast2_2plus_AD_fromShamil.tsv - list of confidently dominant genes
clinvar_atleast2_2plus_AR_fromShamil.tsv - list of confidently recessive genes
diagnoses_for_Andrew.tsv - public subset of pathogenic CMG SVs
exons_Appris_featurized_transcript_Chr1-Y_loeuf_domino_arad.bed - annotated list of exons
exons_Appris_featurized_transcript_Chr1-Y_loeuf_domino.bed - annotated list of exons
exons_Appris_featurized_transcript_Chr1-Y_loeuf.sorted.bed - annotated list of exons
exons_Appris_featurized_transcript_Chr1-Y.sorted.bed - annotated list of exons
exons_CDS_Chr1-Y.sorted.bed - annotated list of exons
fileList.txt - list of files in directory
forSVScore.2020.AllChrs.header.vcf - header file
forSVScore.header.scored.20200302.tsv - output file from SVScore
forSVScore.header.scored.20200302.vcf - output file from SVScore
forSVScore.header.scored.2020.oneVarPerGene.tsv - output file from SVScore
forSVScore.header.scored.2020.vcf - output file from SVScore
forSVScore.header.vcf - input file for SVScore
forSVScore.vcf - input file for SVScore
forVeP.GRCh38.header.vcf - input file for VeP
forVeP.GRCh38.header.vep.vcf - input file for VeP
forVeP.GRCh38.vcf - input file for VeP
fromLO.bed - liftover output file
fromLO.err - liftover error file
fromLO.nocomments.err - modified liftover error file
genes_Appris_Chr1-Y.bed - annotated list of genes
genes_Appris_Chr1-Y.sorted.bed - annotated list of genes
gnomad_v2_sv.sites.hg19.full.filtered.header.bed - subset of gnomAD SVs
gnomad_v2_sv.sites.hg38.bed - full gnomAD SVs
GRCh38_hg38_variants_2016-08-31.txt - Database of Genomic Variants SVs
great_ape_sv_gl_and_grch38b_EVA_gs.vcf.gz - Great Ape SVs
h - header file
hg19ToHg38.over.chain.gz - liftover file
hg38chromsizes.tsv - list of chromosomes ranked by size
hg38.phastCons100way.bw - phastCons scores
hg38.phyloP100way.bw - phyloP scores
hg38ToHg19.over.chain.gz - liftover file
liftOver - liftover program
path.header - header file
rep12tadsMergedhg38.bed - TADs used to annotate SVs
score_all_final_19.02.19.txt - Domino (dominant gene predictor) scores
summary_exon_usage_hg38.sorted.bed - exons annotated by exon usage and expression
variant_summary.txt.gz - ClinVar SVs

Please direct all questions to sharo@berkeley.edu, brenner@compbio.berkeley.edu

Source code for StrVCTVRE: a supervised learning method to predict the pathogenicity of human genome structural variants

Data files

Abstract

Usage notes

Works referencing this dataset