Genetic, developmental, and neural changes underlie evolving butterfly mate preference
Data files
Sep 29, 2023 version files 360.60 MB
-
heliconius_cydno_alithea.yellow.decorated.gff3.gz
-
heliconius_cydno_alithea.yellow.fa.gz
-
heliconius_cydno_RNAseq.2023.tgz
-
README.md
Oct 22, 2024 version files 3.05 GB
-
gwas.tar.gz
-
README.md
-
rnaseq.tar.gz
Abstract
Many studies have linked genetic variation to behavior, but the links between that variation and the neural circuits that drive behavior remain elusive. We investigated the architecture of mate choice behavior in Heliconius butterflies, which use vision to identify preferred mates based on wing color patterns. We found that Heliconius cydno mate preference is associated with inter-photoreceptor inhibition of ultraviolet-sensitive photoreceptors (PRs) by long-wavelength sensitive PRs; identified a small number of genetic loci associated with preference variation; and began to link these multiple layers of behavior variation together through analyses of developmental gene networks. Our results support the idea that altered peripheral neural computations, driven by changes to underlying developmental genetic processes, can significantly and rapidly alter essential behaviors.
This repository contains raw data, scripts, notebooks, and intermediate files that that were used to perform and analyze genome-wide association studies and to analyze RNA-seq data generated by and presented in the associated publication.
README: Genetic, developmental, and neural changes underlie evolving butterfly mate preference
Abstract
This Dryad repository contains raw data, scripts, and notebooks that can recapitulate the key analyses performed in the associated publication and allow further exploration the results. This repository contains three main sections:
- Data, scripts, and a notebook detailing the genome-wide association analyses for forewing color and male mate choice. Data includes:
- Raw forewing color and courtship behavior data from Chamberlain et al. (2009).
- Genome-wide variant calls using whole-genome re-sequencing data from 113 H. cydno alithea males.
- Raw and intermediate analysis results for color and choice genome-wide association tests
- Intermediate results for analyses of linkage disequilibrium and $F_{ST}$ statistics
- The H. c. alithea (yellow) reference genome sequence and gene annotation generated and used in this paper
- Data, scripts, and a notebook detailing the analysis of RNA-sequencing data for H. c. alithea and H. c. galanthus central brain, optic lobe, and retina at seven developmental stages. Data includes:
- Intermediate gene expression data
- Full results from differential expression analyses presented the paper
Versions
- 2024-10-18 Update:
- Updated title to reflect order of paper results.
- Updated Methods section of this repository to add information of how data required for analyses described in the
gwas
directory were generated. - Minor updates to initial
rnaseq
submission, including: minor text edits innotebook.Rmd
for clarity; the addition of an updated filedata/GenesInOtherPrefPeaks.50kb.2024-10-10.txt
containing a list of genes found near genome-wide association (GWA) peaks on chromosomes 7, 9, and 11. - Relocated top-level genome and genome annotation files into
gwas/info/genome
directory. - Added a new
gwas
directory containing data, scripts, and an R notebook required for running the GWA analyses presented in the associated publication, plus additional QC steps that were not presented in the publication but are important for understanding our final analyses. This README has been updated to reflect this addition and describes each of those files below. Thegwas/notebook.Rmd
R notebook contains additional information and will knit properly after installation of the required R packages.
Description of the data and file structure
Description of data and files within the gwas
directory:
The most important files in this directory are the notebook.Rmd and the gwas_notebook.Rproj files. These two files can together be used to re-run analyses presented in the paper, contain much of the detail and reasoning behind each analysis, and link all of the data and scripts described below into a single cohesive story. Please start there.
gwas.tar.gz
- gwas_notebook.Rproj: R project file for loading into R and RStudio. In conjunction with
notebook.Rmd
, can be used to re-run analyses presented in the associated paper and to run new analyses of interest to end user. - notebook.Rmd: R markdown file containing detailed notes, code chunks, and additional data processing steps that can be used to recapitulate the paper results and allow exploration of the results further.
- expected_notebook.html: output from knitting
notebook.Rmd
before dryad submission. End users may simply browse this notebook or knit the project fresh. - cluster_scripts/
- PreprocessReads.sh: a shell script for SLURM that can be used to perform quality control and mapping of whole-genome re-sequencing data for the samples found in BioProject PRJNA802836
- RunHaplotypeCaller.sh: a shell script SLURM that can be used to perform the first step of calling genome-wide variation in the alignment files generated by PreprocessReads.sh. This script depends on the GenomeAnalysisToolKit v4.2 and the HaplotypeCaller module.
- RunGenotypeGvcfs.sh: a shell script for SLURM that can be used to perform the second step of calling genome-wide variation using the output from RunHaplotypeCaller.sh. This script depends on the GenomeAnalysisToolKit v4.2.
- RunGemmaFwcGwa.sh: a shell script for SLURM that can be used to perform the genome-wide association for male forewing color (fwc) using the variant callset found in
data/plink/heliconius_cydno_alithea_yellow.variants.20230712.pruned...
- RunGemmaLmpGwa.sh: a shell script for SLURM that can be used to perform the genome-wide association for male lifetime mate preference (lmp) using the variant callset found in
data/plink/heliconius_cydno_alithea_yellow.variants.20230712.pruned...
- RunGmmatChoiceWald.R: an R script that will perform the genome-wide association for choice using GMMAT 1.4.2, the variant callset in
data/plink/heliconius_cydno_alithea_yellow.variants.20230712.pruned...
and the raw phenoytpe and choice data indata/phenotype_data/MalePhenotypeData.20190306.txt
anddata/phenotype_data/CourtsForR.20190506.EXPANDED.csv
. - RunGmmatChoiceWald.sh: a shell script for SLURM that wraps RunGmmatChoiceWald.R. - RunGmmatFwcWald.R: an R script that will perform the genome-wide association for male forewing color (fwc) using GMMAT 1.4.2, the variant callset in
data/plink/heliconius_cydno_alithea_yellow.variants.20230712.pruned...
and the raw phenoytpe data indata/phenotype_data/MalePhenotypeData.20190306.txt
. - RunGmmatFwcWald.sh: a shell script for SLURM that wraps
RunGmmatFwcWald.R
.
- current_cache/: an empty directory to contain cached results from knitting the R notebook
./notebook.Rmd
- data/
- fst/
- heliconius_cydno_alithea_yellow.variants.20230712.pruned.10k2k.windowed.weir.fst.gz: gzipped text file containing $F_{ST}$ calculations in 10 kb sliding windows (2 kb step) between yellow and white males. Calculated using vcftools 0.16 and the variant calls in
data/plink/heliconius_cydno_alithea_yellow.variants.20230712.pruned...
- gwas/
- choice.grm_only.20240923.out.txt.gz: gzipped text file containing the genome-wide association results for male mate choice including only the genetic relatedness matrix (grm) as a random effect.
- choice.grm_stock.20240925.out.txt.gz: gzipped text file containing the genome-wide association results for male mate choice including the genetic relatedness matrix (grm) and principal component 2 (pc2, or stock) as random effects.
- forewing_color.cXX.txt: the genetic relatedness matrix calculated using GEMMA 0.98.5 and the variant callset in
data/plink/heliconius_cydno_alithea_yellow.variants.20230712.pruned...
- forewing_color.grm_only.assoc.txt.gz: gzipped text file containing the genome-wide association results for male forewing color including only the genetic relatedness matrix as a random effect.
- forewing_color.grm_pc2.assoc.txt.gz: gzipped text file containing the genome-wide association results for male forewing color including the genetic relatedness matrix and principal component 2 (pc2, or stock) as random effects.
- forewing_color.grm_pcs1-3.assoc.txt.gz: gzipped text file containing the genome-wide association results for male forewing color including the genetic relatedness matrix (grm) and principal components 1-3 as random effects.
- klocus.court_num.20240926.out.txt.gz: gzipped text file containing the association results for male mate choice in the K locus (chr1:14.5Mb-16.5Mb) including the genetic relatedness matrix (grm) a random effect and court number as an additional fixed effect.
- klocus.male_fwc.20240926.out.txt.gz: gzipped text file containing the association results for male mate choice in the K locus (chr1:14.5Mb-16.5Mb) including the genetic relatedness matrix (grm) a random effect and male forewing color as an additional fixed effect.
- preference_all.grm_only.assoc.txt.gz: gzipped text file containing the genome-wide association results for male lifetime preference (estimated using all courts) and including only the genetic relatedness matrix as a random effect.
- ld/
- K_locus_LD.lt10kb.ld.gz: gzipped text file containing pairwise linkage disequilibrium (LD) estimates for all pairs of K locus variants less than 10 kb apart.
- LdEstimates.100Mb.240905.txt: text file containing summaries of LD values for 600 million random pairs of variants from across the genome.
- snp_ld_values: text file containing pairwise LD estimates for pairs of top forewing color and male mate choice variants identified using genome-wide association.
- plink/
- heliconius_cydno_alithea_yellow.variants.20230712.pruned.[bed|bim|fam|nosex]: plink bfile set containing filtered variant calls for all 113 males in the final dataset. This was the callset used for all genome-wide association analyses.
- heliconius_cydno_alithea_yellow.variants.20230712.pruned.eigenv[ec|al]: text files containing principal components (PC) analysis results for the first 20 PCs. Calculated using PLINK 1.90 and the plink bfile set in this directory.
- heliconius_cydno_alithea_yellow.variants.20230712.pruned.forewing_color.fam: plink fam file containing the male forewing color information.
- susie/
- SusieData.20241003.Rdata: R data file containing two objects used for the SuSiE analysis described in the publication and the notebook.
- info/
- genome/
- heliconius_cydno_alithea.yellow.fa.gz: bgzipped FASTA format file containing the yellow Heliconius cydno alithea genome sequence.
- heliconius_cydno_alithea.yellow.agp: text file containing information linking
heliconius_cydno_alithea.yellow.fa
scaffolds to the Heliconius melpomene v2.5 genome assembly. Generated using RagTag and subsequent massaging withawk
. - heliconius_cydno_alithea.yellow.genome: text file containing
heliconius_cydno_alithea.yellow.fa
scaffold names and lengths. - heliconius_cydno_alithea.yellow.decorated.modified.gff3.gz: gzipped generic feature format v3 (GFF3) file containing gene annotation information for the
heliconius_cydno_alithea.yellow.fa
genome sequence. - phenotype_data/
- CourtsForR.20190506.csv: csv file containing raw courtship data from Chamberlain et al. (2009)
- CourtsForR.20190506.EXPANDED.csv: csv file containing raw courtship data from Chamberlain et al. (2009) EXPANDED to include a single courtship event per row.
- GenotypePhenotypeCourtshipDataForR.20190620.txt: tab-delimited file containing genotype and courtship data for males studied by Chamberlain et al. (2009)
- MalePhenotypeData.20190306.txt: tab-delimited file containing male wing color pattern information from Chamberlain et al. (2009)
- sequencing_qc/
- Raw-Alithea-Data_multiqc_report[.html|_data.zip]: html file and associated data summarizing the quality of the raw whole-genome re-sequencing data using FastQC and MultiQC. Data can be downloaded from NCBI BioProject PRJNA802836.
- Trimmed-Alithea-Data_multiqc_report[.html|_data.zip]: html file and associated data summarizing the quality of the processed whole-genome re-sequencing data using FastQC and MultiQC. Data can be downloaded from NCBI BioProject PRJNA802836 and processed using
cluster_scripts/PreprocesReads.sh
. - SequencingQcAndMappingForR.20190620.txt: text file containing summary data for re-sequenced samples. - plots/: empty directory for holding plots generated by knitting the
notebook.Rmd\
file in RStudio.
- scripts/
- RunGmmatScore.R: R script to run the genome-wide assocation analysis for male mate choice using GMMAT and the score test.
- functions_for_gwa.R: R script containing useful functions for loading data, plotting, and massaging data.
Description of data and files within the rnaseq
directory:
The most important files in this directory are the notebook.Rmd and the rnaseq.Rproj files. These two files can together be used to re-run analyses presented in the paper, contain much of the detail and reasoning behind each analysis, and link all of the data and scripts described below into a single cohesive story. Please start there.
rnaseq.tar.gz
- rnaseq.Rproj: R project file for loading into RStudio
- notebook.Rmd: R markdown notebook containing detailed notes, pipelines, and code chunks for recapitulating the published results and to allow further independent exploration by the end user.
- data/
- AllGeneFunctionalAnnotations.2023-04-04.csv.gz:
- GenesInKlocus.2023-05-23.txt: A text file containing a list of genes within the K locus (Hcay201001o:14500000-16500000)
- GenesInOtherPrefPeaks.50kb.2024-10-10.txt: A text file containing a list of genes within 50 kb of the top GWA peaks on chromosomes 7, 9, and 11.
- heliconius_cydno_alithea_deseq_results.Rdata: R data file containing the results from all stage-specific differential expression analyses used in the publication. This is a list of tibbles,
deseq_results
, where each tibble contains the full results from a single stage-specific run of DESeq2. This can be used to analyze stage-specific DE results for any gene. - helicoinius_cydno_alithea.yellow.eggnog.csv.gz: A gzipped CSV file containing the results from running eggNOG's
emapper
utility on the full yellow Heliconius cydno alithea protein annotation set. This contains information of putative orthologs and functions for each gene. - heliconius_cydno_data.Rdata: An Rdata file containing four objects used to run the publication analyses.
tx2gene
: a 2-column tibble containing transcript-to-gene mappingsample_info
: a tibble containing RNA-seq sample informationinitial_dds
: a DESeq2 data object containing unfiltered gene quantification data for each sample insample_info
. Values were loaded from raw salmon output for each sample usingtximport
package.filtered_dds
: a DESeq2 data object containing filtered gene quantification data for the final set of 260 samples included in the publication analyses. This is theinitial_dds
object filtered for genes with mean normalized expression values >50 and for non-outlier samples.- heliconius_cydno_DEGs.ym_v_wm.gFDR_0.01.Rdata: An Rdata file containing a list of vectors of DE genes,
DEGs_list
, found in each relevant DESeq2 and maSigPro comparison. This isdeseq_results
filtered to keep only genes with global FDR <= 0.01 for each comparison, plus genes found to be DE bymaSigPro
. See the notebook for details on how these vectors were generated. - heliconius_cydno_masigpro_results.Rdata: An Rdata file containing the results from
maSigPro
runs from each tissue. This file contains list objects for each tissue (central brain, optic lobe, and retina) with the rawmaSigPro
results. You can extract significant genes usingmaSigPro::get.siggenes()
. - heliconius_cydno_wgcna_objects.Rdata: An Rdata file containing the objects required for running and generated by WGCNA with the
filtered_dds
andsample_info
information. This file contains: datExpr
: expression data, generated fromDESeq2::counts(filtered_dds, norm = T)
datTraits
: trait data, derived fromsample_info
dynamicMEs
: gene co-expression module eigenvectors (summaries of gene expression for all genes within that module), for dynamically pruned modules.dynamicModuleColors
: vector containing module colors for each gene included indatExpr
. These are the module assignments given to each gene by running WGCNA.dynamicModuleLabels
: vector containing module labels (numbers) for each gene included indatExpr
. These are the module assignments given to each gene by running WGCNA.gene_tree
: hierarchical clustering object (fromhclust
) describing the relationships of all genes indatExpr
. You can plot usingbase::plot()
.METree
: hierarchical clustering object (fromhclust
) describing the relationships of all co-expressed gene modules, based on the eigenvectors contained indynamicMEs
. You can visualize usingbase::plot
- plots/publication_plots/rPCA outlier identification/: directory containing the rPCA outlier plots that we relied on for the publication's filtering and analysis steps. These may differ from the analogous plots generated by the notebook because of random seeding. Files are labeled
9-OutlierId.all_{stage}_{tissue}.outliers.png
. Where stage and tissue are descrbed in the notebook. - results/module_GOs/
- go_enrichment/: directory containing
topGO
gene ontology (GO) enrichment results in CSV format for each co-expressed gene module identified in the publication analyses. Module colors match those in thedynamicModuleColors
vector. - go_figure/: directory containing plots generated using
GO-figure!
, the results from WGCNA, and the gene annotations held inheliconius_cydno_alithea.yellow.eggnog.csv.gz
. The results for each invocation are found in independent directories. Thetopgo_files
directory contains the properly formatted input files, derived from the eggNOG annotations. - graphs/: empty directory to contain output from
topGO
runs produced during knitting.
- go_enrichment/: directory containing
- scripts/:
- heliconius_RNAseq_functions.R: R script file containing custom functions that I used to produce plots, massage data, etc.
- modified_masigpro_functions.R: R script file containing
maSigPro
functions that have been modified to produce more useable/aesthetically pleasing output. These supersede functions in the nativemaSigPro
package when knitting this notebook.
Sharing/Access information
Raw whole genome sequencing data can be downloaded from the NCBI Short Read Archive through BioProject PRJNA802836.
Raw RNA-sequencing data can be downloaded from the NCBI SRA through BioProject PRJNA1019262.
Code/Software
R notebooks
All R work was performed in RStudio 2024.04.1+748 and R version 4.4.
R notebooks contained in the gwas
and rnaseq
directories contain all information needed to run and should install required packages upon first execution. The expected_notebook.html
files contain the expected results from simply opening the notebooks in RStudio and knitting using knitr
. Packages and version numbers are contained in those html
files via sessionInfo()
.
All Rcode contained within the gwas
and rnaseq
directories should be self-contained and able to run on standard system setups. NWV executed R notebooks on a 2015 MacBook Pro with 16Gb RAM and 2.5GHz i7 processor running Monterey 12.7.5. W Lu executed on a Windows machine with 16Gb RAM, 2.4GHz i7 processor and running Windows 10.
R notebooks contain additional commands that we used to run standalone programs on a compute cluster or on the UNIX command line. These are contained within their own chunks within those notebooks. These chunks were used to produce many of the intermediate results files found in data
, and the input and output to those commands is written relative to the respective directory.
SLURM scripts
It is likely that users will need to edit the SLURM directives in any of the cluster_scripts
scripts for their particular setups, in particular the account and the module loading steps.
Additional programs
Methods
Animals
We used butterflies from four different taxa. The H. c. alithea used in the preference and color GWA analyses were previously tested for courtship behavior in Ecuador in 2008 by Nicola Chamberlain and Durrell Kapan. These butterflies were tested for their preference, then the bodies stored in 100% ethanol at -80oC until genomic DNA extractions (see below) [1]. For all other experiments, butterflies were housed in greenhouse breeding colonies at the University of Chicago that were regularly supplemented with new individuals. Adults were fed Bird’s Choice artificial nectar ad libitum and supplied with blooming Lantana as an additional source of nectar and pollen. Heliconius cydno galanthus and H. melpomene pupae were obtained from El Bosque Nuevo in Costa Rica, and H. c. alithea from Heliconius Butterfly Works in Ecuador. Heliconius pachinus and F1 H. c. galanthus X H. pachinus hybrids were bred in Panama and adults were transported to Chicago for experiments. Collection, rearing, import and export were done under permits from Ecuador, Panama, Costa Rica, and USA.
Heliconius cydno alithea (yellow) genome assembly and annotation
We isolated DNA from thorax of a single adult yellow H. cydno alithea female using the QIAGEN Genomic-tip 20/G following the manufacturer’s instructions with the following modifications: tissue was incubated at 50oC shaking at 800 rpm overnight in lysis buffer. We used 4 ug of this high molecular weight DNA as input to Oxford Nanopore Technologies (ONT) ligation library preparation kit SQK-LSK 110. We prepared libraries following the manufacturer’s instructions with modifications based on the protocol found here: https://www.protocols.io/view/dna-extraction-and-nanopore-library-prep-from-15-3-bp2l6n3kzgqe/v1. This protocol differs from the manufacturer’s protocols in the following ways. End-repair was performed at 20oC for 1 hour, dA-tailing was performed for 30 minutes, ligation was performed for 1 hour at room temperature, and all bead elution steps were allowed to proceed for one hour at room temperature. We also used PacBio SRE XS kit to remove <10kb fragments in the final libraries.
Final libraries were sequenced on an ONT MinION with version 9.4.1 flow cell. We performed basecalling using Guppy and the super accurate basecalling model in dna_r9.4.1_450bps_sup.cfg supplied with the basecaller. For the genome assembly, we adopted a similar strategy as (Steward et al. 2021). We generated the initial draft assembly using Flye 2.9 [2] with estimated genome size of 282Mb (based on GenomeScope estimate https://github.com/schatzlab/genomescope) and Shasta 0.10.0 (https://github.com/chanzuckerberg/shasta) with default parameters. The Flye assembly and Shasta assembly were polished with two rounds of racon 1.5.0 (https://github.com/isovic/racon) and one round of medaka 1.8.1 with ONT reads, and then purged to remove duplicate scaffolds (typically uncollapsed allelic variation) using purge_dups (https://github.com/dfguan/purge_dups). Finally, the duplicate scaffolds were merged together with quickmerge (https://github.com/mahulchak/quickmerge) and purged using purge_dups.
To simplify comparisons across species, we scaffolded H. c. alithea contigs to the Heliconius melpomene v2.5 chromosome-level assembly using RagTag [3] and renamed H. c. alithea scaffolds to match. Finally, we identified and soft-masked repeat sequences using RepeatModeler and RepeatMasker [4,5]. The genome sequence and annotation used in this study can be found in this repository z8w9ghxjz in the `gwas/data/genome/` directory. The final genome assembly comprised 310 scaffolds spanning 294 Mb, with 287 Mb assigned to H. melpomene chromosomes. BUSCO v5 analysis showed the H. c. alithea genome contained 97.7% complete (97.3% single-copy, 0.4% duplicated), 0.4% fragmented, and 1.9% missing OrthoDB v10 Endopteryogota (2,124 single-copy orthologs) SCOs.
We annotated H. c. alithea scaffolds using EvidenceModeler 1.1.1 [6]. We first assembled the H. cydno transcriptome de novo using RNA-seq data generated by Walters et al. [7], Nallu et al. [8], and Rossi et al. [9] using Trinity v2.10.0 [10]. RNA-seq data was also mapped to using STAR 2.6.1d [11], and the resulting alignments used to generate genome-guided assemblies using Trinity and StringTie 1.3.1 [12]. We combined de novo and genome-guided assemblies using PASA [13]. Evidence for protein-coding regions came from mapping the UniProt/Swiss-Prot (2020_06) database and all Papilionoidea proteins available in NCBI’s GenBank nr protein database (downloaded 6/2020) using exonerate [14]. We identified high-quality multi-exon protein-coding PASA transcripts using TransDecoder (transdecoder.github.io), then used these models to train and run Genemark-ET 4 [15] and GlimmerHMM 3.0.4 [16]. We also predicted gene models using Augustus 3.3.2 [17], the supplied heliconius_melpomene1 parameter set, and hints derived from RNA-seq and protein mapping above. Augustus predictions with >90% of their length covered by hints were considered high-quality models. Transcript, protein, and ab initio data were integrated using EVM with the weights in table S8.
Raw EVM models were then updated twice using PASA to add UTRs and identify alternative transcripts. BUSCO v5 analysis of the final annotated protein set showed it contained 94.8% complete, 1.8% fragmented, and 3.4% missing OrthoDB v10 endopteryogta SCOs (n = 2124). Gene models derived from transposable element proteins were identified using BLASTp and removed from the annotation set. Functional annotations were applied to this annotation set using eggNOG mapper v5 [18]. The final annotation comprises 18,763 protein-coding genes and 30,325 transcripts. We identified 1:1 orthologs to Drosophila melanogaster proteins using reciprocal BLASTp, assigning orthologs only to those genes where the top hit was identical between the two directions (i.e. Hca → Dmel AND Dmel → Hca). Gene annotations, eggNOG results, and Drosophila orthologs are supplied in this repository z8w9ghxjz.
Heliconius cydno alithea genome re-sequencing and variant calling
Genomic DNA used for calling variants was isolated from thorax of 113 H. c. alithea males studied by Chamberlain et al. [1] using chloroform extractions. These individuals were assessed for their mate preference in 2008, then stored in 100% ethanol at -80oC until gDNA extractions in 2015 - 2019. We re-sequenced all individuals with multiple courts, plus a number of males with just a single court, that produced high-quality genomic DNA, yielding a subset of 113 males from the original 175 included in the original study. Illumina paired-end libraries were constructed using the KAPA Hyper Prep Kit (KAPA Biosystems) or Nextera Library Prep Kit and sequenced to ~15X using 2x100 bp on an Illumina HiSeq2500 or 4000 at the University of Chicago Functional Genomics Facility. Raw data can be found in NCBI BioProject PRJNA802836.
Low-quality regions and adapters were trimmed from raw reads using Trimmomatic before mapping to the H. c. alithea reference using bowtie2 v2.3.2 with default settings except --very-sensitive-local [19]. We then marked PCR duplicate reads with Picard and realigned around putative indels using the Genome Analysis Toolkit (GATK) 4.2 [20,21]. SNP and indel calling was performed using the HaplotypeCaller and GenotypeGVCFs module in GATK 4.3.0 with the heterozygosity priors set to 0.01 for both SNPs and indels. Scripts and variant calls in PLINK bed/bim/fam format can be found in this repository in `gwas/data/plink`.
RNA-sequencing data
We aimed to collect RNA-sequencing data from retina, optic lobe, and brain tissue at seven developmental stages in H. c. galanthus, white H. c. alithea, and yellow H. c. alithea males and females in triplicate. We used controlled crosses between H. c. alithea males and females that were homozygous for the top wing color variant, thus ensuring that larvae and pupae from each cross would (if they were allowed to emerge) develop a single wing color. We identified appropriate adults for crosses by clipping a single leg from each individual that emerged from each shipment, extracting DNA from that leg using DNA ExtractALL reagents (Thermo), then performing a custom TaqMan genotyping assay for the wing color variant using the leg DNA. Only males and females that were homozygous for the yellow or white allele were used to set up “yellow” or “white” crosses. All H. c. galanthus individuals were used in H. c. galanthus crosses. We set up crosses between multiple males and females in the UChicago greenhouse and provided ample host plants for egg lay. Caterpillars and pupae were maintained in separate small cages for each cross, and individuals were labeled upon pupation to track developmental timing. We collected tissues from one larval stage (final instar purple crawler, ~36h before pupation), five pupal stages (p0: 12 - 24 hours after pupation, p2: 48 - 60 hap, p4: 96 - 108 hap, p6: 144-156 hap, and p7: 168-180 hap), and one adult stage (ad: 24-48 hours after emergence). Pupal sex was determined using external pupal characteristics (https://www.ucl.ac.uk/taxome/jim/Mim2/heliconius_pupa_sex_difference.html) as well as the presence/absence of testis, which are very prominent in butterflies.
We collected head tissue from purple crawler and p0 pupae because the main neural tissues are small and difficult to separate. We collected retina, optic lobe, and central brain separately for all remaining stages. We dissected individuals in cold PBS and immediately placed dissected tissues into RNAlater (Ambion, USA). Tissues were stored in RNAlater at -80oC until RNA extraction using TRIzol (Ambion, USA). High quality (RIN > 7) RNA samples were treated with Turbo DNAse (Invitrogen, USA), then 1 ug was used as input for poly-A selection and RNA-seq library preparation using the NEBNext Poly(A) mRNA Magnetic Isolation Module and NEBNext UltraII Directional RNA Prep Kit following the manufacturer’s instructions with minor modifications. RNA fragmentation was performed for 10 min at 94oC. We used the NEBNext Multiplex Oligos for Illumina dual-index adapters to uniquely barcode each sample. Double-sided selection was performed after adapter ligation to enrich for ~300 bp - 500 bp fragments. Final libraries were PCR amplified for 11 cycles. RNA-seq libraries were pooled and sequenced 2x100 bp on a NovaSeq 6000 at the University of Chicago Functional Genomics Facility.
All raw RNA-seq data downloaded from NCBI BioProject PRJNA1019262.
Preliminary processing of RNA-seq data
We quantified gene expression in each sample using the raw reads, the yellow H. c. alithea transcriptome, and salmon v1.9.0 [31]. The whole genome sequence was included as the decoy, and sequence composition, GC, and positional bias corrections were used during quantification. Indexes and quantification used k-mer size 31. Quantifications, scripts for analysis, and other data objects can be found in this repository in the `rnaseq/` direcotry.