Workflow of the system for CRISPR Outcome and Risk Evaluation (SCORE)

Li, Shengdi 1 2 ; Vonesch, Sibylle3 2; Roy, Kevin1; Szu Tu, Chelsea2; Steudle, Friederike2; Nguyen, Michelle1; Jann, Cosimo2; Steinmetz, Lars1 2

Published Sep 24, 2025 on Dryad. https://doi.org/10.5061/dryad.mw6m9068k

Data files

Sep 24, 2025 version files 12.31 MB

MAGESTIC-SCORE-main.zip

12.30 MB
README.md

11.92 KB

Abstract

It is unclear how CRISPR editing outcomes vary across the genome and whether undesirable events such as structural variants (SVs) are predictable or preventable. Here, we describe a computational workflow to process whole-genome sequencing (WGS) data generated for multiplexed CRISPR/Cas genome editing experiments. The workflow characterizes and classifies diverse editing outcomes resulting from CRISPR and trains a predictive model based on a machine-learning-based framework termed SCORE (System for CRISPR Outcome and Risk Evaluation).

A genome-wide annotation of difficult-to-edit regions based on MAGESTIC and SCORE

Install conda environment

a. Install conda environment using CONDA (replace {myenv_name} with the name of environment to create):

git clone https://github.com/shli-embl/MAGESTIC-SCORE.git
cd MAGESTIC-SCORE/
conda env create -f envs/environment.yaml -p envs/{myenv_name}

Or using MAMBA (recommended):

git clone https://github.com/shli-embl/MAGESTIC-SCORE.git
cd MAGESTIC-SCORE/
mamba env create -f envs/environment.yaml -p envs/{myenv_name}

If you are using our "MAGESTIC-SCORE-main.zip" downloaded from DRYAD, unzip it first and then do:

cd MAGESTIC-SCORE/
mamba env create -f envs/environment.yaml -p envs/{myenv_name}

b. Then, activate the environment before running the snakemake pipelines:

conda activate envs/{myenv_name}

Snakemake pipelines

Pipeline 1: Single clonal whole-genome sequencing data analysis and editing outcome (pre-)calling

The single clonal WGS analysis pipeline takes in .fastq read files from multiple Cas9 edited strains isolated from cell pool created by MAGESTIC, so that each single genome contains a barcode linked to a pre-designed gRNA-donor-DNA pair to introduce a variant on the target locus. The pipeline contains read processing and on-target genotyping (both for SNV/indel detection via GATK4 and SV detection via a 3-in-1 pre-selection approach). The SV pre-scan includes GATK4 (to detect unintended SNV/indels proximate to the target), SvABA and CNV detection. We also extended the outcome analysis to a list of off-target loci predicted by edit-distance from the on-target sequence.

1a. Modify the configuration files

cd 1_outcome_precalling
ls inputs/

file 1. gRNA_list.txt (sample id and gRNA sequence):

MAGESTIC_REDI_1	CTCCTCAGTTTGTCCACGGT
MAGESTIC_REDI_10	AGAGGAAGTCTCAACGGCAG
MAGESTIC_REDI_100	ATACTGGCCACGTTTGACAA
MAGESTIC_REDI_1000	TTGTGATTTTATTGATTCTG
MAGESTIC_REDI_1001	GGTAACAAAGTCACGGCTCC
...

file 2. input.yaml (required by snakemake for locating file paths):

---
samplesheet: 'inputs/samples.txt'
grna: 'inputs/gRNA_list.txt'
genome: 'inputs/fasta/yeast_reference.fa'
tmp_dir: './tmpdir'
output_dir: './output'
offtar: 'inputs/off_target.txt'

file 3. samples.txt (sample information, read file paths, with header):

sample	read1fq	read2fq	chr	strand	PAM_coordinate	variant_coordinate	template_start_coordinate	template_end_coordinate	reference_allele	mutation_allele	template_sequence	target_mutations	synthetic_errors
MAGESTIC_REDI_1	/g/steinmetz/incoming/solexa/2018-10-24-HCW3HAFXY/HCW3HAFXY_CST85_18s005308-1-1_Szu-Tu_lane1012AA1_1_sequence.txt.gz	/g/steinmetz/incoming/solexa/2018-10-24-HCW3HAFXY/HCW3HAFXY_CST85_18s005308-1-1_Szu-Tu_lane1012AA1_2_sequence.txt.gz	chr8	-	104423	104423	104371	104479	CCCA	ACCC	TGGCTGCTATCGTTGAAATTATCGACCAAAAGAAGGTATGTTGAACCTAAAAACCCCCGTGGACAAACTGAGGAGGAAATTGTAAGGAAGAGAAAGTCCCCGTATGTTC	chr8_104423_C_to_A,chr8_104426_A_to_C
...

Note: template_sequence and synthetic_errors were obtained from REDI sequencing of the barcode locus prior to the WGS.
file 4. off_target.txt (list of off-target location predicted from edit distance <= 3):

sample	guide	chr	PAM_coordinate	match	strand	mismatch
MAGESTIC_REDI_10	AGAGGAAGTCTCAACGGCAGNNN	chr4	1503943	AGAGGAAGTCTCAACaGCAGAGG	+	1
MAGESTIC_REDI_10	AGAGGAAGTCTCAACGGCAGNNN	chr4	1503961	AGAGGAAGTCTCAACaGCAGAGG	+	1
MAGESTIC_REDI_10	AGAGGAAGTCTCAACGGCAGNNN	chr4	1503889	AGAGGAAGTCTCAgCaGCAGAGG	+	2
MAGESTIC_REDI_10	AGAGGAAGTCTCAACGGCAGNNN	chr4	1503925	AGAaGAAGTCTCAACaGCAGAGG	+	2
MAGESTIC_REDI_10	AGAGGAAGTCTCAACGGCAGNNN	chr4	1503979	AGAGGAAGTCTCAgCaGCAGAGG	+	2
MAGESTIC_REDI_10	AGAGGAAGTCTCAACGGCAGNNN	chr4	1503853	AGAaGAAaTCTCAACaGCAGAGG	+	3
...

1b. Run pipeline

We use a HPC cluster for computing the task, for which the configuration of job submission and control are saved in profile/config.yaml.

cd 1_outcome_precalling
snakemake --profile profile

Pipeline 2: Target site sequence and genomic feature annotations

For a given list of PAM sites (and gRNA sequences), our pipeline 2 outputs a full list of annotations that are used for constructing the predictive model used in Pipeline 3. These annotations includes target sequence features, chromatin accessibility, histone modification, as well as the customized sequence repetitiveness metrics (and more details about feature list can be found in our publication).

2a. Prepare annotation files

Run the script to download pre-processed annotation files for the SCORE model (yeast chromatin, histone modification data). Please find the information about how the processed data were generated from 2_target_list_annotation/readmeplease.txt.

cd 2_target_list_annotation/
sh download_annotation.sh

2b. Modify the configuration files

ls inputs/

file 1. input.yaml (required by snakemake for locating file paths):

---
targetsheet: 'inputs/targets.txt'
genome: 'inputs/fasta/yeast_reference.fa'
tmp_dir: './tmpdir'
output_dir: './output'
atac_seq_bed: 'inputs/data/lin_all.merge.ins.bedgraph'
centromere_coords: 'inputs/data/yeast_CEN.coords'
transcription_forward_strand: 'inputs/data/WT.polyA.forward.depth'
transcription_reverse_strand: 'inputs/data/WT.polyA.reverse.depth'
hist_mod_list: 'inputs/data/histone_modification_binned_bed_files.txt'

file 2. targets.txt (list of samples with corresponding gRNA targets):

sample	chr	strand	coord	gRNA_sequence
MAGESTIC_REDI_1	chr8	-	104423	CTCCTCAGTTTGTCCACGGT
MAGESTIC_REDI_10	chr4	+	1503892	AGAGGAAGTCTCAACGGCAG
MAGESTIC_REDI_100	chr11	+	220649	ATACTGGCCACGTTTGACAA
MAGESTIC_REDI_1000	chr10	+	92631	TTGTGATTTTATTGATTCTG
MAGESTIC_REDI_1001	chr6	+	20768	GGTAACAAAGTCACGGCTCC

2c. Run pipeline

We use a HPC cluster for computing the task, for which the configuration of job submission and control are saved in profile/config.yaml.

cd 2_target_list_annotation/
snakemake --profile profile

Pipeline 3: Machine learning model construction and tuning

We use the R caret package to perform model searching across a number of hyperparameters: 1) ML algorithms (gradient-boosted trees, random forest, L1- or L2-regularized regressions; 2) Hyperparamters within each ML algorithm (e.g. tree number ...); 3) Feature subsets (target sequence, chromatin context and genomic repetitiveness); 4) Data imbalance treatment (over- or down-sampling, with SMOTE or ADASYN). Repeated k-fold cross validation was used to assess the model performance.

3a. Modify the configuration files

file 1. input.yaml (required by snakemake for locating file paths):

---
observed_outcome: 'inputs/edit_outcome.txt'
feature_table: 'inputs/combined_features.txt'
tmp_dir: './tmpdir'
output_dir: './output'

file 2. edit_outcome.txt (measured editing outcomes for samples encoded as 0/1 matrix, where 1 represents a matched unwanted outcome):

*Note that some sample editing outcomes are indicated as 'NA' referring to unclassified or unknown (e.g., due to low sequencing coverage at the target sites). 'NA's are automatically disregarded and not used for training by our program.

New_id	Editing_outcome_NE	Editing_outcome_SV	Editing_outcome_LD	Editing_outcome_NRecT
MAGESTIC_REDI_1	0	0	0	0
MAGESTIC_REDI_10	NA	NA	NA	NA
MAGESTIC_REDI_100	NA	NA	NA	NA
MAGESTIC_REDI_1000	1	0	0	0
MAGESTIC_REDI_1001	1	0	0	0
MAGESTIC_REDI_1002	0	0	0	0
MAGESTIC_REDI_1003	0	0	0	0

file 3. combined_features.txt (feature table obtained from Pipeline 2, with additional columns in the end to assign editing system and variant type as correction factors):

id	ATACseq_ins_freq	bp_to_telomere	bp_to_centromere	norm_distance_to_telomere	H2AK5ac	H2AS129ph	H3K4ac	H3K4me	H3K4me2	H3K4me3	H3K9ac	H3K14ac	H3K18ac	H3K23ac	H3K27ac	H3K36meH3K36me2	H3K36me3	H3K56ac	H3K79me	H3K79me3	H3S10ph	H4K5ac	H4K8ac	H4K12ac	H4K15acH4K20me	H4R3me	H4R3me2s	HTZ1	IGR100	IGR1000	IGR10000	IGR50	IGR500	IGR5000	ILR100	ILR1000	ILR10000	ILR50	ILR500	ILR5000	m0	m1	m2	m3	T_score	target_sequence_30mer	Transcription_Cas9_bind_template_strand	Transcription_Cas9_bind_non_template_strand	TRL2500x2	TRL250x2	TRL25x2	TRL5000x2	TRL500x2	TRL50x2	Editing_system_1	Editing_system_2	Editing_system_3	variant_type_SNV	variant_type_INDEL	variant_type_MNV
MAGESTIC_REDI_1	3.06	104423	1163	0.989	0.571911931	-1.413944272	1.82410521	-1.913294733	-1.06959095	2.134772687	1.647827909	1.381763726	2.527676605	1.081172126	0.557087315	-0.340625001	-0.460996809	-0.281504667	1.408581249	0.038505213	-0.193029944	0.632719882	0.000316058	-0.412611799	-0.306177367	-0.719997155	-1.018213765	-0.464537889	-1.073701456	-0.6606927	0.005	0.135	0.048	0	0.013	0.086	-0.00110.0166	0.0122	-0.0114	0.0314	0.0089	1	0	0	0	3.5	ATTTCCTCCTCAGTTTGTCCACGGTGGGTT	-5.644	9.1	17	10	0	21	17	6	1	0	0	0	0	1

3b. Run pipeline

We use a HPC cluster for computing the task, for which the configuration of job submission and control are saved in profile/config.yaml.

cd 3_model_training/
snakemake --profile profile

Pipeline 4: Bulk whole-genome sequencing analysis for structral variant characterizations

We used bulk DNA sequencing to infer the composition of SVs vs. normal edits in a list of predicted difficult-to-edit loci in yeast genome. Pipeline 4 composes of the processing of raw .fastq reads and the quantification of coverage depth in the regions of interest.

4a. Modify the configuration files

file 1. samples.txt (list of individual bulk WGS sample):

sample	editing_system	read1fq	read2fq	variant	variant_upstream	variant_downstream	known_outcome	guide_donor_plasmid	guide_donor_plasmid_version	editing_system_plasmid	barcoding_locusplot_window
locus_LD_1_Cas9_LexA_rep_1	Cas9_LexA	/g/steinmetz/project/yeast_crispr/REDI/STF_spike_in_WGS/raw_data/STF_SV_editing_final/20220325_MAGESTIC_3_SV_CSM_HIS_5FC_outgrowth_rep_2_Tn5/fastq/20220422_Tn5_plate_2A_sample_25_R1.fastq.gz;/g/steinmetz/project/yeast_crispr/REDI/STF_spike_in_WGS/raw_data/STF_SV_editing_final/20220325_MAGESTIC_3_SV_CSM_HIS_5FC_outgrowth_rep_2_Tn5_reseq/fastq/20220805_Tn5_plate_2A_sample_25_R1.fastq.gz	/g/steinmetz/project/yeast_crispr/REDI/STF_spike_in_WGS/raw_data/STF_SV_editing_final/20220325_MAGESTIC_3_SV_CSM_HIS_5FC_outgrowth_rep_2_Tn5/fastq/20220422_Tn5_plate_2A_sample_25_R2.fastq.gz;/g/steinmetz/project/yeast_crispr/REDI/STF_spike_in_WGS/raw_data/STF_SV_editing_final/20220325_MAGESTIC_3_SV_CSM_HIS_5FC_outgrowth_rep_2_Tn5_reseq/fastq/20220805_Tn5_plate_2A_sample_25_R2.fastq.gz	chr8_287671_AGGTGG_GAATTC	TACCAACAA	AAGCAAAGT	3832-bp deletion	pS1399	v0	pKR373	yKR889	chr8:283500-293500

file 2. input.yaml (required by snakemake for locating file paths):

---
samplesheet: 'inputs/samples.txt'
genome: 'inputs/fasta/yeast_reference.fa'
guide_donor_plasmids: 'inputs/fasta/all_guide_donor_editing_plasmid.fa'
variants: 'inputs/variants.txt'
tmp_dir: './tmpdir'
region: 'inputs/regions.bed'
output_dir: './output'

file 3. variants.txt (the wild type and mutated genotypes for each variant):

variant	variant_upstream	variant_downstream	plot_window
chr8_287671_AGGTGG_GAATTC	TACCAACAA	AAGCAAAGT	chr8:283500-293500
chr3_249330_GATCCTGA_AATTCGCT	AATCCGCCG	ACCTCCAAA	chr3:249100-249600
chr4_1161885_CCTTGG_GAATTC	AGCGTAGCC	AGAAGCATA	chr4:1157000-1167000
chr7_111607_ACAAGG_GAATTC	AGCAGAAAT	ATTTATAAT	chr7:106000-116000
chr15_1070807_ACCAATG_GCCAATA	AAACATGCG	GGTCAAAAT	chr15:1069800-1071800
chr8_5782_CCCAGA_GAATTC	TGGTGATTA	TGTCATGCA	chr8:0-10000
chr14_751267_TTCTG_AATTC	GACAACCAG	GAACCCAGA	chr14:746000-756000
chr1_27356_TAATA_AATTC	CAGTCCCTG	TTATCAGGG	chr1:22000-32000
chr9_435817_CCTG_GGAT	AATTTTTGG	CCTCACTAT	chr9:429800-439800

4b. Run pipeline

We use a HPC cluster for computing the task, for which the configuration of job submission and control are saved in profile/config.yaml.

cd 4_bulk_wgs_analysis/
snakemake --profile profile

Workflow of the system for CRISPR Outcome and Risk Evaluation (SCORE)

Data files

Abstract

README: MAGESTIC-SCORE

A genome-wide annotation of difficult-to-edit regions based on MAGESTIC and SCORE

Install conda environment

Snakemake pipelines

Pipeline 1: Single clonal whole-genome sequencing data analysis and editing outcome (pre-)calling

1a. Modify the configuration files

1b. Run pipeline

Pipeline 2: Target site sequence and genomic feature annotations

2a. Prepare annotation files

2b. Modify the configuration files

2c. Run pipeline

Pipeline 3: Machine learning model construction and tuning

3a. Modify the configuration files

3b. Run pipeline

Pipeline 4: Bulk whole-genome sequencing analysis for structral variant characterizations

4a. Modify the configuration files

4b. Run pipeline