Predicting genome-wide tissue-specific enhancers via combinatorial transcription factor genomic occupancy analysis

Abbasi, Amir 1 ; Shireen, Huma1 ; Khatoon, Hizran1 ; Batool, Fatima1 ; Sehar, Noor1 ; Parveen, Nazia1

Research facility: Comparative and Evolutionary Genomics Lab

Published Oct 07, 2024 on Dryad. https://doi.org/10.5061/dryad.34tmpg4qn

Data files

Oct 07, 2024 version files 18.15 MB

Human_forebrain_enhancers_coordinates_TFclusters.xlsx

1.82 MB
Human_forebrain_enhancers_sequences.fasta

16.32 MB
README.md

6.04 KB

Abstract

Background

Enhancers belong to the class of non-coding cis-regulatory elements that play a vital role in transcriptional regulation. Mutations in enhancers effect gene regulation and can lead to various disease phenotypes. This has led to an increased interest in identifying enhancers and evaluating the impact of mutations on the enhancers’ activities. However, in contrast to protein-coding intervals, enhancers lack a stereotyped sequence composition. Therefore, the computational prediction of enhancers and their tissue-specificity has remained challenging. Consequently, enhancers are typically predicted based on certain chromatin features, including DNA accessibility, post-translational modifications of histones, and transcription factor (TF) binding. Although these features correlate with enhancer regions, they are only imperfect predictors.

Results

The present study reports a sequence-based computational model that employs combinatorial TF genomic occupancy as principal determinant to predict tissue-specific enhancers. This model was trained on different data sets including the Encyclopedia of DNA Elements (ENCODE) based DNA accessibility data, Vista enhancer browser based in vivo experimental data, and phylogenetic foot-printing of binding motifs. The application of this novel computational scheme has enabled the prediction of 25,000 forebrain specific cis-regulatory modules (CRMs) in human genome. These predicted CRMs were subjected to validation phase by using ENCODE based enhancer-associated biochemical features, GWAS-based disease associated SNPs and in vivo analysis in zebrafish.

Conclusion

Validation data revealed that this new computational model is suitable for predicting less well-conserved tissue-specific enhancers regions that are devoid of characterized chromatin features, and therefore is able to complement and facilitate experimental approaches in tissue-specific enhancer discovery.

Title: Predicting genome-wide tissue-specific enhancers via combinatorial transcription factor genomic occupancy analysis

From cellular differentiation to development, the spatiotemporal expression and regulation of tissue-specific genes is mediated by DNA regulatory elements.

Enhancers, also known as distal-acting cis-regulatory modules (CRMs), are defined as cis-acting DNA sequences responsible for increasing the level of gene expression. Their functionality is orientation-independent, and their distance from the target gene body is highly variable. Given the significant role of tissue-specific enhancers in disease and development, there is currently a great deal of interest in annotating the human genome for tissue/cell-type-specific enhancers. However, this poses a serious challenge for several reasons:

Vast Search Space: Enhancers are scattered across the 98.5% of the human genome that is non-protein-coding, creating a vast search space (billions of base pairs).
Variable Positioning: Although enhancers regulate genes in cis, their positions relative to target genes are highly variable—they can be upstream, downstream, or within introns. They may bypass neighboring genes to regulate distant genes.
Complex Gene Regulation: Some enhancers regulate multiple genes, adding further complexity to their annotation.
Lack of a Well-defined Sequence Code: Unlike protein-coding genes with well-defined sequence codes, the sequence code of enhancers, if it exists, remains poorly understood.

Over the last couple of decades, significant progress has been made in enhancer prediction using evolutionary and biochemical methods. However, both approaches have limitations. Evolutionary approaches lack the strength to detect lineage-specific enhancers, while biochemical marks on enhancer-associated chromatin regions are not always deterministic of enhancer function and can occur stochastically.

In the present study, we report a DNA-sequence-based model that utilizes tissue-specific transcription factor (TF) occupancy to enhance the accuracy of enhancer prediction. Our pipeline heavily relies on prior datasets, such as:

A known set of transcription factors (TFs) relevant to the tissue/cell type of interest
Well-characterized binding motifs for the known set of TFs
A benchmark dataset comprising enhancers characterized in vivo or in vitro, relevant to the tissue/cell type of interest

Based on this principle, using the mammalian forebrain as a model organ, we employed this approach and identified approximately 25,000 distinct forebrain enhancers across the non-coding and non-repetitive portions of the human genome.

The functional relevance of these predicted forebrain enhancer datasets (CRMs) was evaluated by intersecting our predictions with:

Active enhancer-related chromatin marks relevant to the brain
DNase hypersensitive (HS) site data from brain cell lines
GWAS-based brain SNPs

Additionally, a subset of our predictions was directly tested through in vivo assays using zebrafish transgenic models. Furthermore, evolutionary conservation depth analysis of our predictions revealed that the majority of our predicted forebrain CRMs (>85%) are conserved only in mammals or primates, highlighting the relevance of our approach in predicting lineage-specific enhancers.

Our integrated approach provides a reliable complement to classical computational methods and epigenomic-based assays for predicting tissue-specific enhancers.

The workflow is inherently generalizable and can be adapted to predict enhancers for any specific tissue type.
The predicted set of 25,000 forebrain cis-regulatory modules can serve as a valuable resource for:

1) Screening brain-relevant pathological and developmental mechanisms.

2) Exploring the gene-regulatory basis of mammalian/primate brain evolution.

This README file describes the dataset, which contains genomic features of approximately 25,000 forebrain cis-regulatory modules (CRMs/enhancers) predicted using the aforementioned workflow/principle.

Files:

There are two files:

1) Human_forebrain_enhancers_coordinates_TFclusters.xlsx:
The file contains a table that shows the genome-wide catalogue of predicted human forebrain specific cis-regulatory modules (CRMs) or enhancers.

Variables used:
Crm_id: This unique identifier (Column 1) is assigned to each human forebrain-specific enhancer. The dataset includes a total of 25,000 Crm_ids.
Chromosome: Column 2 lists the chromosome names where the predicted CRMs are located.
Start_position: Column 3 indicates the start position of the predicted enhancers within the human genome (Assembly version GRCh37).
End_position: Column 4 specifies the end position of the predicted enhancers in the human genome (Assembly version GRCh37).
Size in base pairs: This column (column 5) shows the length of each of the predicted CRM in base pairs.
Transcription factors: Column 6 shows the clustered transciption factors (TFs) residing in each of the predicted CRM.

2) Human_forebrain_enhancers_sequences.fasta: This file contains sequences of 25000 human forebrain enhancers where each of the predicted enhancer is termed as CRM.

All sequences in Fasta format are from human genome assembly version GRCh37.
Sequence of each CRM starts with a header line. for example >GRCh37|chr1:780200-780944|Homo sapiens|chr1_crm_1|forebrain enhancer|DNA length=745
Here ">" symbol shows the start of new sequence.
GRCh37 is the version of human genome assembly used.
Genome assembly is followed by the genomic location and name of specie to which each CRM belongs.
After the name of specie there is a "crm_id", a unique identifier assigned to each of the predicted CRM.
Next to the crm_id is the name of tissue to which the crm belongs. For instance in current study it is forebrain.
At the end of header line is the DNA length which depicts the size of each crm in base pairs.

Based on heterotypic cooperativity of transcriptional factors (TFs), in this work we present a high-throughput workflow for the prediction of tissue specific enhancers at a genome-wide scale. The study is composed of two phases.

Phase 1: Prediction of key TFs likely to have a role in forebrain development.

Phase 2: Devising a pipeline for the prediction of tissue specific enhancers by employing the heterotypic cooperativity among TFs (curated from phase 1).

In phase 1, we aimed to pinpoint a set of TFs that have combinatorial genomic occupancy and play a significant role during human forebrain development and differentiation processes. In initial scrutiny, through extensive literature survey, we manually curated a library of 93 TFs relevant to human forebrain-tissue. The library of 93 TFs was then subjected to two different strategies namely (I) Motif discovery through statistical over-representation and continuous tag sequence density estimation through DNase hypersensitive sites (DHSs) map, (II) Regular expression based algorithm. In strategy-1, first step was to predict candidate transcription factor binding sites (TFBSs) of these 93 TFs computationally. Hence, we used two different programs called CLOVER and F-seq. CLOVER is a method to screen a set of DNA sequences against a precompiled library of motifs and select the motifs which are statistically over-represented in the sequences. In Clover, binding profiles of 93 TFs (collected from JASPAR) were searched on 104 forebrain specific human enhancers (FSHEs) (positive control) from Vista enhancer browser, and a set of 100 human non-coding non-conserved sequences (NCNCSs) (negative control) from UCSC genome browser. In addition to that, F-Seq is a software package that generates a continuous tag sequence density estimation to predict binding sites. In F-seq, DNase I hypersensitive sites (DHSs) data of three independent cell lines from ENCODE namely GM12878 (Lymphocytes), Cerebrum_Frontal_OC (Cerebrum), and Frontal_Cortex_OC (Cortex) were used to predict DNase I hypersensitive region of chromatin on FSHEs (positive control) and NCNCSs (negative control). CLOVER predicted binding sites and F-seq predicted binding sites regions were then overlapped for the accurate prediction of TFBSs. Furthermore, to categorize cluster of co-occurring TFBSs in human forebrain specific enhancers, we applied principal component analysis (PCA).

In the second strategy, we employed phylogenetic foot-printing based approach to predict the evolutionary conserved binding sites of 93 forebrain TFs (curated from literature) in FSHEs (positive control) and NCNCSs (negative control). For this purpose, we designed a regular expression based algorithm, that aligns human and mouse FSHEs (positive control) and NCNCSs (negative control) orthologue sequences. The designed algorithm pinpoints TRANSFAC based binding sites (for 93 forebrain TFs) that are conserved among human-mouse orthologous sequences. Evolutionary conserved predicted binding sites were then subjected to PCA to identify the cluster of conserved TFBSs in each of the FSHEs (positive control) and NCNCs (negative control). PCA results from strategy 1 and strategy II were compared and as a result 23 TFs were found common and shortlisted for further analyses. The 23 shortlisted TFs were then employed to predict the clusters of co-occurring motifs that might serve as forebrain cis-regulatory modules. For this purpose, we used three human genomic regions that are enriched of forebrain related (development/expression) protein coding genes. These genomic segments were scanned for 23 TFs by using purposely designed sliding window brute force search algorithm (SDBFSA). Conclusively, we found that a subset of 6/93 TFs namely FOXP2, OTX1, OTX2, GATA3, HES5 and NGN2 share a relatively higher heterotypic as well as homotypic binding sites preferences.

The set of 6TFs were subjected to phase 2 where we adopted the combinatorial code to predict the clusters of these 6 TFs at genome wide-scale. For this purpose, a specialized Perl script was designed and applied on repeat and exon masked sequence of human genome (GRCh37). As a result of it, we identified the clusters of TFBSs that were closely spaced at the spacer distance of 250bps. Each of the identified cluster contained binding sites for at least 5 of the 6 distinct TFs. We termed each such cluster as an independent cis-regulatory module (CRMs) and assigned a unique identifier called “crm_id” to it. Resultantly, we compiled a catalogue of 25,000 forebrain-specific CRMs which were composed of concentrated clusters of recognition motifs for 6 core TFs (details mentioned in dataset and README file). These predictions were validated by using ENCODE based enhancer-associated biochemical features, GWAS-based disease- or trait-associated SNPs and in vivo functional analysis in zebrafish. Our devised computational workflow is suitable for predicting less well-conserved tissue-specific enhancers that are devoid of characterized chromatin features, and therefore is able to complement and facilitate experimental approaches in tissue-specific enhancer discovery.

Predicting genome-wide tissue-specific enhancers via combinatorial transcription factor genomic occupancy analysis

Data files

Abstract

README

Title: Predicting genome-wide tissue-specific enhancers via combinatorial transcription factor genomic occupancy analysis

Methods

Usage notes

Works referencing this dataset