DoubleChEC program to identify transcription factor binding sites from mapped ChEC-seq data

Name: DoubleChEC program to identify transcription factor binding sites from mapped ChEC-seq data
Creator: Jason Brickner

Brickner, Jason 1

Research facility: Northwestern University

Published Dec 22, 2023 on Dryad. https://doi.org/10.5061/dryad.c866t1gd5

Abstract

ChIP-seq (chromatin immunoprecipitation followed by sequencing) is commonly used to identify genome-wide protein-DNA interactions. However, ChIP-seq often gives a low yield, which is not ideal for quantitative outcomes. An alternative method to ChIP-seq is ChEC-seq (Chromatin endogenous cleavage with high-throughput sequencing). In this method, the endogenous TF (transcription factor) of interest is fused with MNase (micrococcal nuclease) that non-specifically cleaves DNA near binding sites. Compared to the original ChEC-seq method, the modified version requires far less amplification. Since MACS3 failed to identify peaks in data generated from the modified ChEC-seq method, a new peak finder has been developed specifically for it.

There are three functions in the peak_finder/. callpeaks() is used to identify peaks from BAM files. goanalysis() is used to make GO (Gene Ontology) term plots from peaks. bedtomeme() is a wrapper function to perform MEME analysis in R after MEME Suite is installed locally.

Introduction

ChIP-seq (chromatin immunoprecipitation followed by sequencing) is commonly used to identify genome-wide protein-DNA interactions. However, ChIP-seq often gives a low yield, which is not ideal for quantitative outcomes. An alternative method to ChIP-seq is ChEC-seq (Chromatin endogenous cleavage with high-throughput sequencing). In this method, an endogenous TF (transcription factor) fused to MNase (micrococcal nuclease) cleaves DNA near binding sites. This package is designed to identify high-confidence binding sites from cleavage patterns from ChEC-seq2, a variant form of ChEC-seq.

There are three functions in the peak_finder/. callpeaks() is used to identify peaks from single-end mapped reads input as BAM files. goanalysis() is used to make GO (Gene Ontology) term plots from peaks. bedtomeme() is a wrapper function to perform MEME analysis in R after MEME Suite is installed locally.

Usage

Download the following software and data files.:
- three bam files for Rap1-MN
- three bam files for soluble MNase
- run program.R
- 1a cuts.R
- 1b map cuts.R
- 2 sliding window results treatment CPMn.R
- 3 select local maxima.R
- 4 run DESeq2.R
- 5 ChEC peaks.R
- 6 UAS regions.R
- GO analysis.R
- peak finder.R
- run meme.R
- run program.R
- GO Figure 2 GO term analysis.R
- empty yeast genome coordinates.rds
- yeast upstream regions.rds
- GO SGD_YeastGenesList.csv
- sacCer3.fna
- sacCer3.fna.bfile
- sacCer3.fna.fai
Place all files into a folder called peak_finder. Place the bam files for the three replicates for control soluble MNase (sMN) into a folder within peak_finder called files_control_rap1 and place the bam files for the three replicates for Rap1-MN bam files into a folder called files_treatment_rap1. To run this program with different bam files, substitute them for the bam files provided.
To view data files and scripts:
- csv files: open in Excel or Numbers
- rds and .R files: open in R studio
- fna, fna.fai and fna.bfile: use text editor
- bam files: open with BAMseek. Alternatively:
  - Download samtools (e.g., sudo port install samtools)
  - Use the following command to view BAM files:
    
    samtools view -h yourbamfile less
    
    e.g., samtools view -h B45_JV49_S49_R1_Masked.bam less
Navigate to the peak_finder/ folder and open the run program.R file.
Update the working directory to the location of the peak_finder folder
Load the callpeaks() function with:

source("peak finder.R")

To run peak finder with Rap1 vs sMNase:

callpeaks(folder_treatment = "files_treatment_rap1",
          folder_control = "files_control_rap1",
          outdir = "test_folder"
)

This will verify if the peak finder runs correctly.

Once the function completes execution without any errors, load the goanalysis() function with:

source("GO analysis.R")

and run:

goanalysis(bedfile = "./test_folder/peaks.bed",
           outdir = "test_folder"
           )

to check if a GO term plot is generated.

Lastly, load the bedtomeme() wrapper function with:

source("run meme.R")

and run:

bedtomeme(bed2fasta_filepath = "/opt/local/libexec/meme-5.5.1/bed2fasta",
          fasta_output_filename = "test_folder/fasta4meme.fna",
          genome_file = "sacCer3.fna",
          bed_filepath = "test_folder/peaks.bed",
          meme_filepath = "/opt/local/bin/meme",
          meme_output_folder = "test_folder/meme_output")

to check if MEME results are generated. Since bedtomeme() is a wrapper function, you must have the MEME Suite installed locally prior to using it. See the bedtomeme section below for more details

`callpeaks()`

Use callpeaks() to identify peaks from BAM files. Three output files are generated from the function. First, all the parameter values used to run the algorithm and the number of peaks after each filtering step are documented in a text file. Second, all the peaks identified by the algorithm are stored in a bed file. An Excel file is also created, which includes positions of local maxima, peaks retained after DESeq2, and doublet peaks. If the position of a peak falls within 700bp upstream of a gene, the peak is also annotated with that gene information.

Required arguments

`folder_treatment`:

Specify the folder where the treatment (TF-MNase) BAM files are located. The folder should be placed in the same location as all the scripts.

`folder_control`:

Specify the folder where the control (sMNase) BAM files are located. The folder should be placed in the same location as all the scripts.

Optional arguments/Definitions

`numcores`

Specify the number of CPU cores to use in the mclapply() function. See mclapply() documentation for more details. If Windows OS is detected, only 1 CPU core will be used.

`outdir`

Specify the name of the output folder. The folder will be created, and the peak finder results will be saved within it.

`w`

The sliding window size in base pairs. See Slidingwindow() documentation for more details.

`stepsize`

The step between windows in base pairs. See Slidingwindow() documentation for more details.

`foldchange`

Specify the log foldchange value, which is the same as the lfcThreshold argument in DESeq2. See DESeq2 for more details.

`p_value`

Specify the adjusted p-value, which is used to filter DESeq2 results. See DESeq2 for more details.

`lower_distance`

The minimum distance between doublet peaks.

`upper_distance`

The maximum distance between doublet peaks.

`localmax_output`

If this argument is set as TRUE, a BED file with all identified local maxima will be generated

`peaks_after_solmnase_filter`

If this argument is set as TRUE, a BED file with all the peaks survived sMNase filtering will be generated

`goanalysis()`

Use goanalysis() to generate a GO term plot using a BED file.

Required arguments

`bedfile`

Specify both the location and the name of the BED file.

`outdir`

Specify the name of the folder in which the GO term plot will be saved.

Optional arguments

`fontsize`

Specify the font size of each GO term.

`termlength`

Specify the number of terms that will appear on the GO term plot.

`bedtomeme()`

Use bedtomeme() to perform MEME analysis in R.

Since bedtomeme() is a wrapper function, please follow the steps below to download and install the MEME Suite first prior to using it:

Download the MEME Suite.
Install the MEME Suite by following the instructions. Installation using MacPorts is recommended.
Download the yeast genome.
1. Click “Show/Hide Sequence Databases”
2. Click “fasta_yeast.tgz”
3. Unzip “fasta_yeast.tgz”
4. navigate to the fasta_yeast/UCSCOther/ folder, copy sacCer3.fna, sacCer3.fna.bfile, and sacCer3.fna.fai, then paste them into the location where the peak finder scripts are located.

Required arguments

`bed2fasta_filepath`

Specify the path to the bed2fasta program. This file is usually hidden. Mac users can press shift + command + . to find it.

`fasta_output_filename`

Specify the file path and the name of the FASTA file converted from the BED file.

`genome_file`

Specify the yeast genome file (It’s usually sacCer3.fna).

`bed_filepath`

Specify the file path and the name of the BED file.

`meme_filepath`

Specify the path to the meme program. This file is usually hidden. Mac users can press shift + command + . to find it.

`meme_output_folder`

Specify the name and the file path where the MEME results are saved.

Optional argument

`meme_additional`

Specify any additional arguments you’d like to use with the MEME program.

For example:

bedtomeme(bed2fasta_filepath = "/opt/local/libexec/meme-5.5.1/bed2fasta",
          fasta_output_filename = "test_folder/fasta4meme.fna",
          genome_file = "sacCer3.fna",
          bed_filepath = "test_folder/peaks.bed",
          meme_filepath = "/opt/local/bin/meme",
          meme_additional = "-markov_order 2 -nmotifs 4",
          meme_output_folder = "test_folder/meme_output")

Questions?

If you have any questions, please contact Dr. Jason Brickner (j-brickner@northwestern.edu)

****EXCERPTED FROM BIORXIV PREPRINT; SEE PREPRINT OR PUBLISHED PAPER FOR REFERENCES AND DETAILS****

Yeast strains

All yeast strains were derived from BY4741. A C-terminal micrococcal nuclease fusion was introduced to the protein of interest through transformation and homologous recombination of PCR-amplified DNA. Primers were designed with 50-bp of homology to the 3’ end of the coding sequence of interest. The 3xFLAG-MNase with a KanR marker was amplified from pGZ108 (Zentner et al., 2015) and transformed into BY4741 as previously described. Successful transformation was confirmed by immunoblotting and PCR, followed by sequencing.

Lyophilized DNA oligonucleotides were resuspended in molecular-grade water to a concentration of 100 µM. For ligation, the following pair of oligonucleotides were annealed to produce the Y-adapter: Tn5ME-A (5’-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-3’) and Y-Adapt-i5 R (5’-CTGTCTCTTATACACATCTTCATAGTAATCATC-3’). For Tn5 Tagmentation, the following i7 oligonucleotides were annealed: Tn5ME-B (5’-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG-3’) and Tn5MErev, (5’-PO₄-CTGTCTCTTATACACATCT-3’). Pairs of oligonucleotides were annealed as follows: 45 µl of each oligo (100 µM) was combined with 10 µl of 1 M Potassium Acetate, 300 mM HEPES, pH 7.5 in a 0.2 ml PCR tube. In a thermocycler, the mixture was heated to 95˚C for 4 minutes, cooled 1°C/minute until 50°C, incubated at 50°C for 5 minutes, and then cooled 1°C/minute until 4°C. Hybridized oligos were stored in 15 µl aliquots at -20˚C.

Tn5 purification and adapter loading

Tn5 E54K L372P was purified as previously described (Hennig et al., 2017). We found that Tn5 was sufficiently pure following purification on Ni²⁺-chromatography and we therefore omitted the final gel filtration step. Purified Tn5 was aliquoted and stored at -80°C. Optimal Tn5 activity was determined by cleaving genomic DNA and assessing fragmentation using the Femto Pulse (Figure S2d), and resulting DNA libraries were confirmed to be of appropriate length for Illumina Sequencing by TapeStation (Figure S2e).

Tn5 was thawed on ice and 100 µl Tn5 was added to 10 µl i7 (45 µM) in a 1.7 ml tube and mixed by gently pipetting. The mixture was incubated at 23°C, mixing at 350 rpm for 45 minutes. Adapter-loaded Tn5 was stored at -20°C and used within 24 hours.

*************************************************************

Chromatin endogenous cleavage detailed protocol

Chromatin digestion

1. Grow cells in 10ml overnight at 30°C, 200 rpm.

2. Dilute cultures into 50ml media to OD₆₀₀ ~ 0.1.

3. When cultures reach OD₆₀₀ = 0.5 - 0.8, harvest 25 ODs (i.e. 50ml if the OD600 = 0.5) of cells by centrifugation at 2500 x g for 1 minute.

4. Resuspend cells in 1 ml Buffer A and transfer to a 1.5 ml tube.

5. Pellet cells by centrifugation at 2500 x g for 1 minute, remove supernatant.

6. Wash cells 2 x 1 ml Buffer A, removing supernatant.

7. Resuspend cells in 600 µl Buffer A + 0.1% Digitonin.

8. Transfer tube to a 30°C heat block and incubate for 5 minutes.

9. Add 5 µl of 333 mM CaCl₂, mix by inverting, incubate at 30°C for the appropriate cleavge time (determined empirically for each protein).

10. To stop the reaction, remove 200 µl cells and combine with 200 µl 2x Stop Buffer.

11. Add 8 µl Proteinase K (20 µg/µl) and mix.

12. Incubate at 50°C, agitating 800 rpm for 30 minutes in a thermomixer.

13. Remove samples from thermomixer and cool at room temperature for 5 minutes.

14. Add 400 µl Phenol-Chloroform-Isoamyl Alcohol (25:24:1), pH 7.8, mix.

15. Centrifuge at 24,000 x g, 5 minutes.

16. Transfer aqueous phase to a phase-lock tube.

17. Add 200 µl Phenol-Chloroform-Isoamyl Alcohol (25:24:1), pH 7.8.

18. Invert 10x to mix.

19. Centrifuge at 24,000 x g for 5 minutes.

20. Transfer aqueous phase to a tube containing 1 ml 100% Ethanol.

21. Add 2 µl of linear acrylamide (5 µg/µl).

22. Invert 10x to mix.

23. Incubate at -80°C for 30 minutes.

24. Centrifuge at 24,000 x g, 4°C for 10 minutes.

25. Pour off supernatant.

26. Wash DNA pellet in 1 ml of 70% ethanol.

27. Centrifuge at 24,000 x g for 1 minute.

28. Pour off supernatant. Collect residual ethanol by centrifugation and remove by pipetting.

29. Dry DNA pellet until all ethanol had evaporated.

30. Add 58 µl of 10 mM Tris-HCl, pH 8.5 to DNA pellet.

31. Incubate overnight at room temperature.

32. Incubate at 37°C for 30 minutes.

33. Add 2 µl RNase A (10 µg/µl) to DNA.

34. Incubate at 37°C for 30 minutes.

35. Evaluate molecular weight of DNA by gel electrophoresis, 0.8% agarose or TapeStation.

36. Quantify DNA concentration with the Qubit double-stranded DNA, Broad Range Assay.

37. Stored DNA at 4°C until library preparation was performed (up to a month), then stored at -20°C.

Buffer A

15 mM Tris-HCl, pH 7.5

80 mM KCl

0.1 mM EGTA

1.0 mM PMSF

0.5 mM Spermidine

0.2 mM Spermine

-Add 1 EDTA-Free Protease Inhibitor Tab per 50 ml Buffer A (Roche; Sigma # 11873580001)

2x Stop Buffer

400 mM NaCl

20 mM EDTA

4 mM EGTA

******************************************

Bioinformatic analysis

Quality Control, Trimming, and Mapping

Read quality and sequencer performance was evaluated with FASTQC. Reads were adapter and quality trimmed with Trimmomatic (Bolger et al., 2014) using single-end settings. Bases at either end of a read were trimmed if base-call quality was less than 30, and only reads of length ≥25 bp were retained. Trimmed reads were mapped to the Saccharomyces cerevisiae genome (Engel et al., 2013), version R64-4-1 with Bowtie2 (Langmead and Salzberg, 2012)and mapped reads with a MAPQ <10 were removed with Samtools (Li et al., 2009).

DoubleChEC identification of high-confidence TF binding sites

For peak calling analysis, BAM files for three or more biological replicates of the TF-MNase and soluble MNase were read and trimmed to the first base pair. Unnormalized counts and normalized counts per million (CPMn) were tallied for each base pair in the yeast genome and the average CPMn values among replicates were calculated for each position. Next, mean CPMn values were smoothed using a sliding window of 3 and a step size of 2. Windows with CPMn values less than three times the genome average were filtered out. After this filtering, local maxima (windows with values greater than their immediate neighbors) were identified. Unnormalized reads were smoothed, retaining positions that were identified as local maxima, and inputted them in DESeq2 (version 1.36.0) to identify windows with values significantly higher than those in the soluble MNase control. Only TF-MNase peaks with a greater log2-fold change of 1.7 and an adjusted p-value less than 0.0001 over soluble MNase were retained. Finally, the peaks were filtered again to identify doublet peaks that are between 15bp and 50bp apart, which were merged to single peaks.

GO term plot

A list of genes whose 700bp upstream regions overlap with peaks identified by the peak finder was input to enrichGO (Wu et al., 2021) to generate GO term plots based on biological functions. The 10 most significant GO terms with adjusted p-values less than 0.05 were plotted.

MEME analyses

The MEME Suite (version 5.5.1) was installed onto the local computer and two custom wrapper functions were written in R for the local bed2fasta and meme programs. These functions were then used to convert bed files, generated from peak calling, into FASTA files. These FASTA files were subsequently to generate motif logos. Both bed2fasta and meme programs were run using their default parameter values.

DoubleChEC program to identify transcription factor binding sites from mapped ChEC-seq data

Data files

Abstract

README: DoubleChEC TF binding site finder

Introduction

Usage

callpeaks()

Required arguments

folder_treatment:

folder_control:

Optional arguments/Definitions

numcores

outdir

w

stepsize

foldchange

p_value

lower_distance

upper_distance

localmax_output

peaks_after_solmnase_filter

goanalysis()

Required arguments

bedfile

outdir

Optional arguments

fontsize

termlength

bedtomeme()

Required arguments

bed2fasta_filepath

fasta_output_filename

genome_file

bed_filepath

meme_filepath

meme_output_folder

Optional argument

meme_additional

Questions?

Methods

Bioinformatic analysis

Quality Control, Trimming, and Mapping

`callpeaks()`

`folder_treatment`:

`folder_control`:

`numcores`

`outdir`

`w`

`stepsize`

`foldchange`

`p_value`

`lower_distance`

`upper_distance`

`localmax_output`

`peaks_after_solmnase_filter`

`goanalysis()`

`bedfile`

`outdir`

`fontsize`

`termlength`

`bedtomeme()`

`bed2fasta_filepath`

`fasta_output_filename`

`genome_file`

`bed_filepath`

`meme_filepath`

`meme_output_folder`

`meme_additional`