Supergene evolution via gain of autoregulation

VanKuren, Nicholas 1 ; Shiekh, Sofia1; Fu, Claire1; Massardo, Darli1; Lu, Wei1; Kronforst, Marcus1

Published Mar 22, 2025 on Dryad. https://doi.org/10.5061/dryad.tx95x6b75

Data files

Mar 22, 2025 version files 2.01 GB

atac.tgz
1.07 GB
cutnrun.tgz
187.07 MB
genomes.tgz
183.84 MB
README.md
29.07 KB
rnaseq.tgz
575.99 MB

Abstract

Development requires the coordinated action of many genes across space and time, yet numerous species are able to develop multiple discrete, alternate phenotypes. Such polymorphisms are often controlled by supergenes, sets of tightly-linked loci that function together to control development of a polymorphic phenotype. Although theories of supergene evolution are well established, the functional genetic differences between supergene alleles have been difficult to identify. The doublesex supergene controls mimicry polymorphism in several Papilio swallowtail butterflies, where divergent dsx alleles switch between discrete mimetic or non-mimetic female wing patterns. Here we show that the Papilio alphenor supergene evolved via recruitment of five new cis-regulatory elements (CREs) that control allele-specific dsx expression. Most dsx CREs, including three of the five new CREs, are bound by the DSX transcription factor itself. Additionally, DSX differentially binds a small number of unlinked CREs between mimetic and non-mimetic wings, suggesting that the supergene directly regulates expression of unlinked modifier genes that execute the mimetic development program. We thus identify the functional genetic elements of a supergene; propose that autoregulation provides a simple route to supergene origination and the evolution of dominance; and establish a molecular mechanism for epistasis between supergenes and their modifiers.

Access this dataset on Dryad

Abstract

This repository contains data files and R projects required to recapitulate and further explore the data presented in this paper. It includes four tarballs that perform different analyses for the paper. These tarballs contain the genome sequences and annotation for mimetic and non-mimetic P. alphenor (genomes.tgz); data and R project for analyzing ATAC-seq data (atac.tgz); data and R project for analyzing DSX and H3K4me3 CUT&RUN data (cutnrun.tgz); and data and R project for re-analyzing RNA-seq data generated in VanKuren et al. (2023) and updated to use these latest genomes (rnaseq.tgz). All analyses performed for the paper can be re-run using these data and projects. However, we provide several intermediate or summary files that you can immediately dive into to extract relevant information. (These files can be re-made with the supplied data and scripts, but allow you to skip some long analyses if desired.)

The supplied R projects and notebooks are meant to be comprehensive and guide one through running and analyzing the data. They are your first stop for information. The knitted notebooks are supplied but can be re-made with the R projects. Each file is described below, but your first stop should be the knitted html files that contain many details on the input data, filtering steps, code to run the key QC and analyses, intermediate results, and objects that will allow you to further explore the data.

File Access

Here is a summary of the utlities that one can use to access each file type contained within this repository.

Tarballs can be unpacked using GNU tools tar and gzip. For example, tar -xzf genomes.tgz will yield a directory genomes/ containing the files listed below.
Text files (.txt, .fa, .bed, .gff3, .sh, .R) can be accessed using GNU utilities such as less, more, and cat or with your favorite text editor (e.g. Notepad, emacs)
Comma- or tab-separated values files (.csv, .tsv) can be accessed using the same utilities used to access text files. Additionally, Google Sheets or Microsoft Excel can be used.
Image files (.png) can be opened with your favorite image viewer, such as Preview or Image Browser. They may also be opened in a web browser.
HTML format files (.html) can be opened using your favorite web browser, such as Google Chrome or Safari.
R project files (.Rproj) can be opened in R, or (preferably) RStudio. They contain information on a set of analyses. Similarly, R scripts (.R) and R notebook files (.Rmd) can be opened in R or, ideally, RStudio. These are the core commands and notes required to reproduce analyses.
Shell scripts (.sh) can be executed on a UNIX command line using, e.g. bash TrimMapReads.sh. However, check each script before use because most are set up for running batch jobs on a SLURM-managed computing cluster.

I will note files specific to particular analyses or directories where necessary.

File Inventory: genomes.tgz

This directory contains a total of 6 files.

papilio_alphenor.mimetic.chr.blacklist.bed: BED format file containing blacklisted regions that were excluded from CUT&RUN and ATAC peak calls. See section 0C of the atac notebook for details on its production.
papilio_alphenor.mimetic.chr.fa: Chromosome-level genome assembly for a homozygous mimetic H/H female Papilio alphenor. Can also be indexed and accessed using samtools.
papilio_alphenor.mimetic.curated.chr.gff3: Generic Feature Format v3 (GFF3) annotations for the mimetic assembly.
papilio_alphenor.nonmimetic.chr.blacklist.bed: BED format file containing blacklisted regions that were excluded from CUT&RUN and ATAC peak calls. See section 0C of the atac notebook for details on its production.
papilio_alphenor.nonmimetic.chr.fa: Chromosome-level genome assembly for a homozygous nonmimetic h/h female Papilio alphenor, and associated files.Can also be indexed and accessed using samtools.
papilio_alphenor.nonmimetic.curated.chr.gff3: Generic Feature Format v3 (GFF3) annotations for the nonmimetic assembly. These were lifted over from a high-quality annotation of the mimetic assembly.

File Inventory: cutnrun.tgz

This directory contains a total of 2413 files (5 files and 4 sub-directories at the top).

cutnrun_2024.Rproj: R Project containing information and platform for running the chunks in the Rmd file.
CutnrunAnalysis.2024.html: Knitted version of the Rmd file. This is an html notebook that can be re-created by running the Rmd file, but can also just be browsed in your browser.
CutnrunAnalysis.2024.Rmd: R Markdown File containing the R Notebook used to run the associated analyses. Methods, notes, instructions, and code that can be used to access the Papilio alphenor CUT&RUN data.
DsxDbpAnalysisResults.2024-12-02.csv: Complete DSX DiffBind analysis results used for the paper. Based on the mimetic genome assembly. These can be recreated using the Rmd file, but are provided here for convenience. Column descriptions:
- chr, start, end, width: Chromosome location and peak width
- mean, conc1, conc2, fold, p.value: Normalized read count and statistical information from DiffBind. See the notebook and the DiffBind and DESeq2 documentation for more detail.
- comparison: Samples being compared (sample1.v.sample2)
- gfdr: Globally-corrected false discovery rate.
- peak_name, group, stages: Unique DSX peak name, groups in which it was present, and stages in which it was present. See notebook for more details.
NonmimDsxDbpAnalysisResults.2024-12-02.csv: Complete DSX DiffBind analysis results based on the non-mimetic genome assembly. These can be recreated using the Rmd file, but are provided here for convenience. The full set of commands can be found in scripts/NonmimDbpAnalysis.R. Column descriptions are the same as the previous file.
data/: Directory containing much of the raw data used to make plots in the notebook. Some analysis results, also.
- AllSignificantGenes.20231019.csv: Results from RNA-seq analyses found in the rnaseq.tgz directory. Copied here to maintain independence between tarballs. Column descriptions:
  - gene: Gene identifier
  - masigpro: Qualitative mark for whether the gene was significantly differentially expressed in maSigPro analyses (>0) or not (0)
  - deseq2: Qualitative mark for whether the gene was significantly differentially expressed in any DESeq2 analysis (1) or not (0).
  - deseq2_gfdr.*: Globally corrected false discovery rate in the DESeq2 analysis of the indicated stage (l5, p0, p2, p4, p8)
- h3k4me3.diffbind_sample_sheet.csv, dsx_diffbind_sample_sheet.csv: Sample sheets used to run DiffBind. Column descriptions (see also DiffBind manual):
  - SampleID: Sample identifier
  - Factor: Group identifier
  - Condition: Sample/replicate identifier
  - Genotype: Sample genotype (mimetic or non-mimetic)
  - Sex: Male (m) or female (f)
  - Replicate: Biological replicate
  - Stage: Developmental stage (p2 or p5)
  - bamReads: Relative location of the aligned reads files. These do not exist in this repository - you must re-make them if you desire to re-run this analysis.
  - Peaks: Relative location of each sample’s DSX/H3K4me3 CUT&RUN peaks file (from MACS3)
  - PeakCaller: Format of the peaks files. See documentation.
- raw_fastqs.multiqc_report.html, final_fastqs.multiqc_report.html: FASTQC results from analyzing raw CUT&RUN FASTQ files and processed (adapters trimmed - see scripts/TrimMapReads.sh) FASTQ files.
- initial_read_data.csv: Mapping statistics. Column descriptions:
  - Sample: Sample identifier
  - PF Clusters (M): Sequencer spots passing filters (millions)
  - Filtered Pairs (M): Filtered pairs of reads (millions)
  - Properly Paired (M): Properly paired read pairs (millions)
  - Pct Properly Paired: Percent of total filtered pairs properly paired
  - Duplicate Pairs (M): Duplicate read pairs (millions)
  - Pct Duplicates: Percent of total filtered pairs duplicates
  - Valid Pairs (M): Total number of valid mapped reads (millions)
- PeakCallingStats.2023-03-16.csv: Peak calling stats from initial tests on peak calling approaches. Used for plotting in the notebook. Column descriptions:
  - id: Sample identifier
  - sr_*: SEACR (relaxed)
    - peaks: Number of peaks genome-wide
    - frip: Fraction of reads in peaks
    - mb: Megabases within peaks
  - ss_*: SEACR (stringent)
  - m2n_*: MACS2 (narrow)
  - m2b_*: MACS2 (broad)
- SamplesSequencing.csv: Samples, sequencing indexes, and IDs. Column descriptions:
  - Short ID: Sample identifier
  - i5 Primer: i5 primer name
  - i5 barcode: i5 barcode sequence
  - i7 primer: i7 primer name
  - i7 barcode: i7 barcode sequence
  - Pool: Sequencing pool ID
  - Sequencing ID: Internal sequencing ID
- papilio_alphenor.mimetic.chr.blacklist.bed: BED format file containing blacklisted regions from the mimetic genome, used to mask peak calls.
- rnaseq_data.Rdata: R data object containing RNA-seq analysis objects and results used for creating intersections. This is duplicated from the rnaseq.tgz directory to ensure each tarball is independent.
- mimetic_dsx_peakset.homer/, mimetic_h3k4me3_peakset.homer/, nonmimetic_dsx_peakset.homer/: Results from HOMER analyses described in the notebook for different merged peaksets. DSX (dsx) and H3K4me3 (h3k4me3) peaks based on the mimetic or nonmimetic genomes. See the HOMER documentation for a deeper explanation of each file. Important for you: .motif files can be opened with a standard text editor, .svg files can be opened with an internet browser or a vector image program such as Adobe Illustrator or Inkscape. (Directories contain 735, 510, and 740 files, respectively.)
- peak_calls/: A directory containing CUT&RUN peak calls using different methods. These files are used and described in the notebook. (Directory contains a total of 390 files.) Critical files are:
  - merge/mimetic_dsx_peakset.full.2024-12-02.txt: This is the final DSX peak callset against the mimetic reference. Peak naming and critical analyses were based on this peakset. Columns are: chromosome, start, end, peak name, groups in which peak is present, and stages in which peak is present.
  - merge/mimetic_h3k4me3_peakset.full.2024-12-02.txt: This is the final H3K4me3 peak callset against the mimetic reference. Columns are: chromosome, start, end, peak name, groups in which peak is present, and stages in which peak is present.
  - merge/nonmim_calls/macs3_to_nonmim/merged/nonmimetic_dsx_peakset.full.2024-12-02.txt: This is the final DSX peak callset against the non-mimetic reference. Peak naming and critical analyses were based on this peakset. Columns are: chromosome, start, end, peak name, groups in which peak is present, and stages in which peak is present.
  - merge/nonmim_calls/macs3_to_nonmim/merged/nonmimetic_h3k4me3_peakset.full.2024-12-02.txt: This is the final H3K4me3 peak callset against the non-mimetic reference. Columns are: chromosome, start, end, peak name, groups in which peak is present, and stages in which peak is present.
  - Additional files are peak callsets generated using the program indicated by the directory name. These are fully described in the associated R Notebook. The .summary and .narrowPeak files are simple text files and can be opened using the commands and programs noted above.
- diffbind/: Directory containing R data objects with computationally expensive but important intermediate results. (Directory contains a total of 6 files.)
  - dsx_dba_objects.Rdata: Objects (peak read counts, etc.) used in and resulting from DSX DiffBind analyses against the mimetic reference.
  - h3k4me3_dba_objects.Rdata: Objects (peak read counts, etc.) used in and resulting from H3K4me3 DiffBind analyses against the mimetic reference.
  - nonmim_dsx_dba_objects.Rdata: Objects (peak read counts, etc.) used in and resulting from DSX DiffBind analyses against the non-mimetic reference.
  - dsx_dba_internal_objects.Rdata: Objects (peak read counts, etc.) used in and resulting from DSX DiffBind analyses against the mimetic reference but using the built-in peak merging approach of DiffBind. Just used for testing.
  - dsx_profiles.Rdata, h3k4me3_profiles.Rdata: Computationally expensive objects containing read pileups around dsx and h3k4me3 differentially bound peaks. Used in the notebook.
  - dsx_peak_annotations.Rdata: Objects with merged information from HOMER peak annotations and DSX peak calls and DBA.
info/: (Directory contains a total of 2 files.)
- CUT&RUN Protocol 8-2022.docx: Complete experimental protocol for CUT&RUN. Can be opened in Google Docs or Microsoft Word.
- DsxPeakOrthologScores.2023-12-19.csv: Critical file. Contains DSX binding site motif information for each peak, plus the log-odds score for that motif. Used to generate Fig 3. Calculated using HOMER, then joined in R. Column descriptions:
  - group: Group membership for each peak (mimetic, nonmimetic, or mimetic-specific)
  - peak: Arbitrary peak number
  - score: DSX motif score (log-odds against the consensus motif)
plots/: Directory containing IGV snapshots used to generate notebook. (Directory contains a total of 4 files.)
scripts/: Directory containing R and shell scripts that run key steps in the analysis pipeline. (Directory contains a total of 7 files.)
- TrimMapReads.sh: bash script that will run the first steps in trimming and mapping raw reads.
- BedtoolsMerge3Peaksets.sh: bash script to merge biological replicate peak calls using bedtools.
- GetFrip.sh: bash script to calculate fraction of reads in peaks (frip) given a bed file of regions of interest (e.g. peaks)
- ConvertHomerAnnotatePeaksToCsv.sh: bash script to convert the output of HOMER’s annotatePeaks.pl to comma-separated format.
- JoinAllTables.2023-12-30.R: R script to join differential expression, differential binding, and ATAC data into a single comprehensive table.
- CustomCnrFunctions.R: R script containing functions used to compile the Rmd notebook.
- NonmimDbpAnalysis.R: R script to run DiffBind using the non-mimetic DSX peakset and non-mimetic samples.

File Inventory: atac.tgz

Directory contains a total of 497 files.

atac_notebook.html: HTML notebook containing details on the analyses, code to reproduce key results and plots, and information underlying the manuscript. Can be recreated by executing the associated .Rmd file, but is provided here for convenience.
atac_notebook.Rmd: R Markdown notebook containing details on the analyses, code to reproduce key results and plots, and information underlying the manuscript. Can be Knitted to produce the html format file, which is also provided for convenience.
papilio_alphenor_atac_2023-11.Rproj: R Project file for opening in RStudio. Useful for interacting with the Rmd file and knitting.
data/: (Directory contains a total of 303 files.)
- AllSignificantGenes.20231019.csv: Results from RNA-seq analyses found in the rnaseq.tar.gz directory. Copied here to maintain independence between tarballs. Contains information on differential expression analyses, significance, and rough annotations used to cross-reference CUT&RUN peaks with differentially expressed genes. Column descriptions:
  - gene: Gene identifier
  - masigpro: Qualitative mark for whether the gene was significantly differentially expressed in maSigPro analyses (>0) or not (0)
  - deseq2: Qualitative mark for whether the gene was significantly differentially expressed in any DESeq2 analysis (1) or not (0).
  - deseq2_gfdr.*: Globally corrected false discovery rate in the DESeq2 analysis of the indicated stage (l5, p0, p2, p4, p8)
- DapAnalysisResults.Rdata: Rdata file containg R objects with differential accessibility results from DiffBind. These contain the results from pairwise DA analyses using DiffBind, details can be found in the notebook.
- dsx_peak_annotations.Rdata: Rdata file containing objects with information on the locations of DSX CUT&RUN peaks relative to genes. These were created in the cutnrun directory’s notebook. More information can be found in that project.
- DsxDbpAnalysisResults.2024-11-19.csv: Final results from differential binding analyses of DSX binding. Copied from the CUT&RUN directory; see that directory for more information. These are pairwise comparisons of DSX CUT&RUN signal using DiffBind. Column descriptions:
  - chr, start, end, width: Chromosome location and peak width
  - mean, conc1, conc2, fold, p.value: Normalized read count and statistical information from DiffBind. See the notebook and the DiffBind and DESeq2 documentation for more detail.
  - comparison: Samples being compared (sample1.v.sample2)
  - gfdr: Globally-corrected false discovery rate.
  - peak_name, group, stages: Unique DSX peak name, groups in which it was present, and stages in which it was present. See notebook for more details.
- diffbind/: (Directory contains a total of 4 files.)
  - atac.diffbind_sample_sheet.cluster.csv: DiffBind sample sheet for use on a SLURM computing cluster. Column descriptions (see also DiffBind manual):
    - SampleID: Sample identifier
    - Factor: Group identifier
    - Condition: Sample/replicate identifier
    - Genotype: Sample genotype (mimetic or non-mimetic)
    - Sex: Male (m) or female (f)
    - Replicate: Biological replicate
    - Stage: Developmental stage (p2 or p5)
    - bamReads: Relative location of the aligned reads files. These do not exist in this repository - you must re-make them if you desire to re-run this analysis.
    - Peaks: Relative location of each sample’s DSX/H3K4me3 CUT&RUN peaks file (from MACS3)
    - PeakCaller: Format of the peaks files. See documentation.
  - atac.diffbind_sample_sheet.csv: DiffBind sample sheet for use on a personal computer to run DiffBind analyses. Note that you’d need BAM files to properly run this. Column descriptors are the same as the previous file.
  - atac_dba_analyzed.Rdata: Rdata file containing the analyzed DiffBind data objects. Can be loaded into R and used to access the final DiffBind analysis results used for the manuscript.
  - atac_dba_object.Rdata: Rdata file containing the initial DiffBind data objects used for downstream analyses. Can be loaded in the notebook and used to perform DiffBind analyses.
- multiqc/: Directory containing FastQC results in HTML format for each sample before (raw_multiqc_report.html) and after (final_multiqc_report.html) quality control. (Directory contains a total of 2 files.)
- peak_calls/: Directory containing initial, filtered, and merged ATAC peak calls from Fseq2 for each genome relative to the appropriate reference genome (mimetic or non-mimetic). The .summary and .narrowPeak files are text files that can be opened using the methods described above. Details on how the files were generated are contained within the .Rmd notebook. (Directory contains a total of 293 files.)
  - all_peaks/: Directories containing raw peaks for each sample. frip/ directories contain information of fraction of reads in peaks for each sample relative to its own peakset.
  - filtered_peaks/: Directories containing filtered peaks for each sample. frip/ directories contain information of fraction of reads in peaks for each sample relative to its own peakset.
  - merged/: Directories containing the final, merged peaksets used for all downstream analyses. These are the important ones! See the notebook for more information.
info/: (Directory contains a total of 3 files.)
- ATAC-seq v2.docx: Experimental ATAC-seq protocol used to generate data. Can be opened with Microsoft Word or Google Docs.
- AtacSampleInfo.csv: CSV file containing sample IDs, group, stage, and tissue information. Used to compile information in the notebook. Column descriptions:
  - seqid: Sequencing ID
  - sample: Sample ID
  - sex: Sex, male (m) or female (f)
  - genotype: Mimetic (m, H/H) or non-mimetic (n, h/h)
  - stage: Developmental stage
  - cross: Cross from which these individuals were derived
  - i7_adapter: i7 adapter name
  - i7_index: i7 index sequence
  - i5_adapter: i5 adapter name
  - i5_index: i5 index sequence
  - cycles: Number of PCR cycles to form final library
- AtacSampleStats.csv: Statistics on mapping, duplication rates, etc. Used to produce plots in the notebook. Column descriptions:
  - sample: Sample ID
  - M_pairs: millions of read pairs
  - frac_dup: Fraction of read pairs that are PCR duplicates
  - frac_chrM: Fraction of read pairs mapped to the mitochondrion
  - frac_prop_paired: Fraction of read pairs properly paired in mapping
  - nrf: Non-redundant fraction of pairs
  - pbc1, pbc2: PCR bottleneck coefficients 1 and 2
results/: (Directory contains a total of 172 files.)
- DapAnalysisResults.2025-02-07.csv: Differential accessiblity results from DiffBind, calculated before posting. This file contains information on DA from relevant pairwise comparisons calculated using DiffBind. Can recreate using the notebook. Column descriptions:
  - chr, start, end, width: Chromosome location and peak width
  - mean, conc1, conc2, fold, p.value: Normalized read count and statistical information from DiffBind. See the notebook and the DiffBind and DESeq2 documentation for more detail.
  - comparison: Samples being compared (sample1.v.sample2)
  - gfdr: Globally-corrected false discovery rate.
  - peak_name: Unique DSX peak name
  - region: Genome region in which the peak falls
  - gene: Gene with which the peak is associated (according to HOMER annotations)
- H_allele_peaks.homer/: HOMER results for all H allele ATAC peaks. See the HOMER documentation for more details. The .motif files are just text files. The .svg files can be visualized with a web browser or a vector image software such as Adobe Illustrator or Inkscape. (Directory contains a total of 95 files.)
- H_allele_peaks.mset_insects.homer/: HOMER results for all H allele ATAC peaks, restricted to binding sites known from insects. See the HOMER documentation for more details. The .motif files are just text files. The .svg files can be visualized with a web browser or a vector image software such as Adobe Illustrator or Inkscape. (Directory contains a total of 76 files.)
scripts/: (Directory contains a total of 8 files.)
- CustomAtacFunctions.R: R script containing custom function for plotting and analysis in the notebook.
- ConvertHomerPeaksToCsv.sh: Shell script to convert HOMER output to CSV format.
- atac/
  - ATACseqQC.R: R script for running ATACseqQC.
  - GetFrip.sh: shell script for calculating fraction of reads in peaks (FRiP) for a given set of genomic intervals and appropriate BAM files.
  - RunFseq2.sh: shell script to run Fseq2 to identify peaks in ATAC-seq data.
  - BedtoolsMerge3Peaksets.fseq2.sh: shell script to merge biological replicate peak sets calculated using fseq2
  - RunATACseqQC.sh: shell script that wraps ATACseqQC.R for parallel execution on a SLURM compute cluster.
  - TrimMapReads.atac.sh: shell script that will pre-process and map raw ATAC-seq reads to a given reference genome.
plots/: Directory containing IGV snapshots for use in creating the notebook. See the notebook for details. (Directory contains a total of 6 files.)

File Inventory: rnaseq.tgz

Directory contains a total of 1407 files.

updated_genome_RNAseq_analysis.Rmd: R markdown document containing the code and documentation needed to recapitulate the results and to enable further exploration of the data. Can be opened in RStudio and executed chunk by chunk. The knitted version is found in the associated html file if one does not wish to re-knit the whole thing.
updated_genome_RNAseq_analysis.html: HTML document of the latest version of the Rmd file. This can be opened in a browser to view a detailed description of the analyses carried out for the ms and some key intermediate results.
updated_genome_RNAseq_analysis_2023.Rproj: R project for loading into RStudio. Used as a wrapper around the Rmd notebook.
data/: (Directory contains a total of 1387 files.)
- alphenor.sample_info.csv: CSV file containing sample information for each RNAseq sample. This is used as the data table for creating a DESeq2 object in the R notebook. Column descriptions:
  - sample: Sample identifier
  - genotype: Mimetic (H/H, m) or non-mimetic (h/h, n)
  - sex: Female (f) or male (m)
  - stage: Developmental stage
  - time: Linear time variable
  - family: Family from which this sample was derived
- papilio_alphenor.mimetic.curated.chr.tx2gene.sep_dsx.txt: Text file containing transcript-to-gene mapping information. Used by tximport to collapse transcript-level quantifications to gene-level quants.
- mimetic.k23: Directory containing the salmon v1.9.0 quantification results for each sample. This directory can be read by tximport to read in transcript- or gene-level quantifications for each sample. Each sample’s results are contained within its own folder, and teh files are comprehensively described in the salmon documentation available here. JSON-formatted (.json) files are database files that can be read using standard text editors. (Directory contains a total of 1380 files.)
- rdata/: (Directory contains a total of 4 files.)
  - all_data_final.Rdata: Rdata file containing the final, filtered gene-level quantification data for each sample. Excluding outliers. Can be loaded into R.
  - all_data_initial.Rdata: Rdata file containing the initial gene-level quantification results from reading in the salmon data for each sample using tximport. Can be loaded into R and further explored.
  - deseq2_results.Rdata: Rdata file containing objects holding the results from the stage-specific DESeq2 analysis used in the manuscript. These are the raw output from the DESeq2 analyses and can be used to further explore those results. Can be loaded into R.
  - masigpro_objects.Rdata: Rdata file containing objects holding the results from the maSigPro analysis used in the manuscript. These are the raw output from the maSigPro steps and can be used to further explore those results and statistics associated with each gene. Can be loaded into R.
- outliers.txt: Text file containing the IDs of outlier samples excluded from the publication analyses.
plots/: Directory containing intermediate plots used for QC and analysis of the RNAseq data. These are included in the knitted notebook and can be re-made by knitting the notebook fully. (Directory contains a total of 14 files.)
results/: (Directory contains a total of 1 file.)
- AllSignificantGenes.2025-02-07.csv: A CSV file containing the latest results from knitting the notebook. This is maSigPro and DESeq2 results for each gene included in the analysis. It can be used to quickly search for genes of interest. It can be re-made by knitting the notebook. Column descriptions:
  - gene: Gene identifier
  - masigpro: Qualitative mark for whether the gene was significantly differentially expressed in maSigPro analyses (>0) or not (0)
  - deseq2: Qualitative mark for whether the gene was significantly differentially expressed in any DESeq2 analysis (1) or not (0).
  - deseq2_gfdr.*: Globally corrected false discovery rate in the DESeq2 analysis of the indicated stage (l5, p0, p2, p4, p8)
scripts/: (Directory contains a total of 3 files.)
- CustomRnaseqFunctions.R: R script containing custom plotting and analysis functions used in the notebook.
- ModifiedMasigproFunctions.R: R script containing functions that I modified from the maSigPro package. Used for analyses in the notebook.
- RNAseqPlottingFunctions.R: Rscript containing some functions for plotting RNAseq data, like gene expression profiles etc.

Raw Data Access

You can download all of the raw data from NCBI SRA through the following BioProjects:

RNAseq: BioProject PRJNA882073
Functional genomics data: BioProject PRJNA1062051

Sample IDs in those projects and in the sample sheets provided in this repository should match.

Supergene evolution via gain of autoregulation

Data files

Abstract

README: Supergene evolution via gain of autoregulation

Abstract

File Access

File Inventory: genomes.tgz

File Inventory: cutnrun.tgz

File Inventory: atac.tgz

File Inventory: rnaseq.tgz

Raw Data Access

Methods

Works referencing this dataset