Data from: Regulatory genome annotation for 33 insect species
Data files
Jul 08, 2024 version files 757.09 MB
-
Asma_2024_bedoutput.tar.gz
104.84 MB
-
Asma_2024_gff_output.tar.gz
652.24 MB
-
README.md
6.03 KB
Dec 02, 2024 version files 772.80 MB
-
Asma_2024_bedoutput.tar.gz
104.84 MB
-
Asma_2024_gff_output.tar.gz
652.24 MB
-
README.md
6.65 KB
-
trainingsets.zip
15.72 MB
Abstract
Annotation of newly-sequenced genomes frequently includes genes, but rarely covers important non-coding genomic features such as the cis -regulatory modules—e.g., enhancers and silencers—that regulate gene expression. Here, we begin to remedy this situation by developing a workflow for rapid initial annotation of insect regulatory sequences, and provide a searchable database resource with enhancer predictions for 33 genomes. Using our previously-developed SCRMshaw computational enhancer prediction method, we predict over 2.8 million regulatory sequences along with the tissues where they are expected to be active, in a set of insect species ranging over 360 million years of evolution. Extensive analysis and validation of the data provides several lines of evidence suggesting that we achieve a high true-positive rate for enhancer prediction. One, we show that our predictions target specific loci, rather than random genomic locations. Two, we predict enhancers in orthologous loci across a diverged set of species to a significantly higher degree than random expectation would allow. Three, we demonstrate that our predictions are highly enriched for regions of accessible chromatin. Four, we achieve a validation rate in excess of 70% using in vivo reporter gene assays. As we continue to annotate both new tissues and new species, our regulatory annotation resource will provide a rich source of data for the research community and will have utility for both small-scale (single gene, single species) and large-scale (many genes, many species) studies of gene regulation. In particular, the ability to search for functionally-related regulatory elements in orthologous loci should greatly facilitate studies of enhancer evolution even among distantly related species.
https://doi.org/10.5061/dryad.3j9kd51t0
These data are the results of regulatory sequence prediction on 33 insect genomes, produced using the SCRMshaw pipeline as described in the associated publication. Five sets of files are provided:
(1) The post-processed SCRMshaw output for each genome. These files all begin with “scrmshawOutput” or “SO_scrmshawOutput”; files beginning with “SO” have undergone the orthology-assignment step. This output has not been further processed to merge overlapping predictions or to merge duplicate predictions generated using different training data. These files have the extension “.bed” and are tab-delimited text files that can be opened using any standard text editor.
(2) The prediction data from each genome, with overlapping and/or duplicate predictions reconciled as described in the protocol and converted to GFF format. These files begin with a species designation followed by the descriptor “converted.” These files have the extension “.gff” and are tab-delimited text files that can be opened using any standard text editor.
(3) The “converted” files from above concatenated to the relevant species annotation GFF file (the same file used during SCRMshaw prediction). These files begin with a species designation followed by the descriptor “merged.” These files have the extension “.gff” and are tab-delimited text files that can be opened using any standard text editor.
(4) The “merged” files from above sorted using gff3_toolkit “sort” (Chen et al. 2019, Methods Mol Biol., PMID:30414112). These files begin with a species designation followed by the descriptor “sorted.” These files are not present for all species, and may have missing lines in certain other species. This is due to the requirements for gff3_tookit to work only with files that strictly adhere to the GFFv3 specification. The species-specific GFF files we worked with do not always meet this criterion and thus fail in whole or in part during the sorting process. Going forward we intend to work primarily with genomes and annotations obtained from RefSeq, which should alleviate this issue. These files have the extension “.gff” and are tab-delimited text files that can be opened using any standard text editor.
(5) A .zip file of the training sets used for regulatory sequence prediction. Each training set has its own directory, named with the name of the training set. Inside each training set directory are two files. ‘crms.fasta’ contains the positive training data. ‘neg.fasta’ contains the corresponding negative training data. Each file is in multi-FASTA format. Names of positive training sequences correspond to entries in the REDfly database.
Description of the data and file structure
Additional details about SCRMshaw scores, peak calling, local rank, training sets, and methods can be found in the associated publication.
Post-processed SCRMshaw output (#1, above) is in the form of an 18-column BED-type format organized as follows:
- Chromosome
- Start coordinate
- End coordinate
- Peak amplitude: maximum amplitude of called SCRMshaw peak from MACS2 analysis
- SCRMshaw score: maximum SCRMshaw output score for scored intervals beneath the peak
- Flanking gene
- D. melanogaster ortholog of flanking gene (if the orthology step has been run)
- Distance of hit from flanking gene (basepairs)
- Location of hit relative to flanking gene: e.g., upstream, downstream, inside (intronic)
- Local rank: rank of peak relative to other called peaks for the given training set within 50 kb to each side
- Next closest flanking gene
- D. melanogaster ortholog of next flanking gene (if the orthology step has been run)
- Distance of hit from flanking gene (basepairs)
- Location of hit relative to flanking gene: e.g., upstream, downstream, inside (intronic)
- Local rank: rank of peak relative to other called peaks for the given training set within 50 kb to each side
- Training set: name of training data file used to generate these predictions
- Method (hexmcd, imm, pac)
- Rank
If the orthologous gene is not known, it is listed as “No_OrthoPara.” Where predictions are merged, multiple results may be provided in each column, depending on the results of the merge (e.g., for method, “imm, hexmcd”). Peak amplitude, score, and rank will contain the best value from among the merged predictions. “Local rank” is described in [4], although its utility as a metric when using the SCRMshaw_HD post-processing procedure has not been determined.
GFF-formatted data (#2-4 above) are in GFFv3-style format. Each enhancer prediction is assigned an ID based on its rank in the combined results. These files follow the GFFv3 specification with columns as follows:
- SeqID (Chromosome or Scaffold)
- Source (will equal SCRMshaw for enhancer predictions)
- Type (will equal cis-regulatory_region for enhancer predictions)
- Start (1-based coordinates)
- End
- Score: maximum SCRMshaw output score for scored intervals beneath the peak
- Strand: empty (“.”) for enhancer predictions
- Phase: empty (“.”) for enhancer predictions
- Attributes: key=value format with the following attribute pairs:
- ID. Values are based on the rank of the SCRMshaw prediction in the form “scrm_n” where “n” is the rank.
- Amplitude: maximum amplitude of called SCRMshaw peak from MACS2 analysis
- TrainingSet
- Method: (hexmcd, imm, pac)
- Rank
Sharing/Access information
Data can also be searched and download from the REDfly database (REDfly:Regulatory Element Database for Drosophilia (RRID:SCR_006790)) by following the “SCRMshaw” link in the toolbar.
Code/Software
The code used to generate these data can be obtained from the Halfon Lab GitHub repository at https://github.com/HalfonLab/Asma_etal_2024_eLife.
Merging overlapping and/or duplicate predictions can be achieved using the following BEDTools (Quinlan & Hall, 2010, Bioinformatics, PMID 20110278) commands:
bedtools sort -i [SCRMshaw output] | bedtools merge -c 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18 -o max,max,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,min |
Version changes
27-nov-2024: Added a .zip file containing the 48 training sets (positive and negative) used to generate the SCRMshaw predictions in the paper.
Computational analysis of genome sequence and annotation.