Skip to main content
Dryad

Data from: Regulatory genome annotation for 33 insect species

Cite this dataset

Halfon, Marc S. et al. (2024). Data from: Regulatory genome annotation for 33 insect species [Dataset]. Dryad. https://doi.org/10.5061/dryad.3j9kd51t0

Abstract

Annotation of newly-sequenced genomes frequently includes genes, but rarely covers important non-coding genomic features such as the cis -regulatory modules—e.g., enhancers and silencers—that regulate gene expression. Here, we begin to remedy this situation by developing a workflow for rapid initial annotation of insect regulatory sequences, and provide a searchable database resource with enhancer predictions for 33 genomes. Using our previously-developed SCRMshaw computational enhancer prediction method, we predict over 2.8 million regulatory sequences along with the tissues where they are expected to be active, in a set of insect species ranging over 360 million years of evolution. Extensive analysis and validation of the data provides several lines of evidence suggesting that we achieve a high true-positive rate for enhancer prediction. One, we show that our predictions target specific loci, rather than random genomic locations. Two, we predict enhancers in orthologous loci across a diverged set of species to a significantly higher degree than random expectation would allow. Three, we demonstrate that our predictions are highly enriched for regions of accessible chromatin. Four, we achieve a validation rate in excess of 70% using in vivo reporter gene assays. As we continue to annotate both new tissues and new species, our regulatory annotation resource will provide a rich source of data for the research community and will have utility for both small-scale (single gene, single species) and large-scale (many genes, many species) studies of gene regulation. In particular, the ability to search for functionally-related regulatory elements in orthologous loci should greatly facilitate studies of enhancer evolution even among distantly related species.

README: Regulatory genome annotation of 33 insect species

https://doi.org/10.5061/dryad.3j9kd51t0

These data are the results of regulatory sequence prediction on 33 insect genomes, produced using the SCRMshaw pipeline as described in the associated publication. Four sets of files are provided:

(1) The post-processed SCRMshaw output for each genome. These files all begin with "scrmshawOutput" or "SO_scrmshawOutput"; files beginning with "SO" have undergone the orthology-assignment step. This output has not been further processed to merge overlapping predictions or to merge duplicate predictions generated using different training data. These files have the extension ".bed" and are tab-delimited text files that can be opened using any standard text editor.

(2) The prediction data from each genome, with overlapping and/or duplicate predictions reconciled as described in the protocol and converted to GFF format. These files begin with a species designation followed by the descriptor "converted." These files have the extension ".gff" and are tab-delimited text files that can be opened using any standard text editor.

(3) The "converted" files from above concatenated to the relevant species annotation GFF file (the same file used during SCRMshaw prediction). These files begin with a species designation followed by the descriptor "merged." These files have the extension ".gff" and are tab-delimited text files that can be opened using any standard text editor.

(4) The "merged" files from above sorted using gff3_toolkit "sort" (Chen et al. 2019, Methods Mol Biol., PMID:30414112). These files begin with a species designation followed by the descriptor "sorted." These files are not present for all species, and may have missing lines in certain other species. This is due to the requirements for gff3_tookit to work only with files that strictly adhere to the GFFv3 specification. The species-specific GFF files we worked with do not always meet this criterion and thus fail in whole or in part during the sorting process. Going forward we intend to work primarily with genomes and annotations obtained from RefSeq, which should alleviate this issue. These files have the extension ".gff" and are tab-delimited text files that can be opened using any standard text editor.

Description of the data and file structure

Additional details about SCRMshaw scores, peak calling, local rank, training sets, and methods can be found in the associated publication.

Post-processed SCRMshaw output (#1, above) is in the form of an 18-column BED-type format organized as follows:

  1. Chromosome
  2. Start coordinate
  3. End  coordinate
  4. Peak amplitude: maximum amplitude of called SCRMshaw peak from MACS2 analysis
  5. SCRMshaw score: maximum SCRMshaw output score for scored intervals beneath the peak
  6. Flanking gene
  7. D. melanogaster ortholog of flanking gene (if the orthology step has been run)
  8. Distance of hit from flanking gene (basepairs)
  9. Location of hit relative to flanking gene: e.g., upstream, downstream, inside (intronic)
  10.  Local rank: rank of peak relative to other called peaks for the given training set within 50 kb to each side
  11.  Next closest flanking gene
  12.  D. melanogaster ortholog of next flanking gene (if the orthology step has been run)
  13. Distance of hit from flanking gene (basepairs)
  14. Location of hit relative to flanking gene: e.g., upstream, downstream, inside (intronic)
  15. Local rank: rank of peak relative to other called peaks for the given training set within 50 kb to each side
  16. Training set: name of training data file used to generate these predictions
  17. Method (hexmcd, imm, pac)
  18. Rank

 If the orthologous gene is not known, it is listed as “No_OrthoPara.” Where predictions are merged, multiple results may be provided in each column, depending on the results of the merge (e.g., for method, “imm, hexmcd”). Peak amplitude, score, and rank will contain the best value from among the merged predictions. “Local rank” is described in [4], although its utility as a metric when using the SCRMshaw_HD post-processing procedure has not been determined.

GFF-formatted data (#2-4 above) are in GFFv3-style format. Each enhancer prediction is assigned an ID based on its rank in the combined results. These files follow the GFFv3 specification with columns as follows:

  1. SeqID (Chromosome or Scaffold)
  2. Source (will equal SCRMshaw for enhancer predictions)
  3. Type (will equal cis-regulatory_region for enhancer predictions)
  4. Start (1-based coordinates)
  5. End
  6. Score: maximum SCRMshaw output score for scored intervals beneath the peak
  7. Strand: empty (".") for enhancer predictions
  8. Phase: empty (".") for enhancer predictions
  9. Attributes: key=value format with the following attribute pairs:
    1.  ID. Values are based on the rank of the SCRMshaw prediction in the form "scrm_n" where "n" is the rank.
    2. Amplitude: maximum amplitude of called SCRMshaw peak from MACS2 analysis
    3. TrainingSet
    4. Method: (hexmcd, imm, pac)
    5. Rank

Sharing/Access information

Data can also be searched and download from the REDfly database (REDfly:Regulatory Element Database for Drosophilia (RRID:SCR_006790)) by following the "SCRMshaw" link in the toolbar.

Code/Software

The code used to generate these data can be obtained from the Halfon Lab GitHub repository at  https://github.com/HalfonLab/Asma_etal_2024_eLife

Merging overlapping and/or duplicate predictions can be achieved using the following BEDTools (Quinlan & Hall, 2010, Bioinformatics, PMID 20110278) commands:

bedtools sort -i [SCRMshaw output] | bedtools merge -c 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18 -o max,max,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,distinct,min

Methods

Computational analysis of genome sequence and annotation.

Funding

United States Department of Agriculture, Award: 2019-67013-29354, NIFA

National Science Foundation, Award: IOS 1557936, IOS/BIO

National Institute of General Medical Sciences, Award: U24 GM14235