Draft Papilio alphenor assembly and annotation
Data files
Aug 24, 2023 version files 104.62 MB
-
papilio_polytes_alphenor.fa.gz
79.95 MB
-
papilio_polytes_alphenor.gff3.gz
4.61 MB
-
papilio_polytes_alphenor.proteins.fa.gz
4.80 MB
-
papilio_polytes_alphenor.transcripts.fa.gz
15.26 MB
-
README.md
1.90 KB
Abstract
Novel phenotypes are increasingly recognized to have evolved by co-option of conserved genes into new developmental contexts, yet the process by which co-opted genes modify existing developmental programs remains obscure. Here we provide insight into this process by characterizing the role of co-opted doublesex in butterfly wing color pattern development. dsx is the master regulator of insect sex differentiation but was co-opted to control the switch between discrete non-mimetic and mimetic patterns in Papilio alphenor and its relatives. We found dynamic spatial and temporal expression pattern differences between mimetic and non-mimetic butterflies throughout wing development. A mimetic color pattern program is switched on by a pulse of dsx expression in early pupal development that causes acute and long-term differential gene expression, particularly in Wnt and Hedgehog signaling pathways. RNAi suggested opposing, novel roles for these pathways in mimetic pattern development. Importantly, Dsx co-option caused Engrailed, a key transcription factor target of Hedgehog signaling, to gain a novel expression domain early in pupal wing development that is propagated through mid-pupal development to specify novel mimetic patterns despite becoming decoupled from Dsx expression itself. Altogether, our findings provide multiple views into how co-opted genes can both cause and elicit changes to conserved networks and pathways to result in development of novel, adaptive phenotypes.
This dataset contains the genome assembly and annotation described in the associated manuscript. Sequencing data, including all RNA-seq data, is available under the NCBI under BioProject PRJNA882073.
Methods
Please see our publication for further details.
The published Papilio polytes reference genome (RefSeq GCF_000836215.1) was generated using P. polytes polytes from Japan, while the RNA-seq data generated in this study is from P. alphenor from the Philippines. These two groups diverged ~1.7 million years ago and have ~5.1% nucleotide divergence, resulting in low mapping rates of our alphenor RNA-seq data to the published reference. The alphenor genome assembly in this dataset is based on the P. polytes polytes assembly and a recent chromosome-level P. bianor assembly.
We assembled a draft alphenor genome using PE100 sequencing from 29 individuals (BioProject PRJNA234541, excluding SRR1118138) and Platanus v2.0.2. Before assembly, we trimmed raw reads using TrimGalore!, then removed reads containing over-represented sequences using FastQC and an available python script. We then assembled all processed reads using the default platanus2 pipeline. Next, we assigned raw alphenor scaffolds to the RefSeq polytes assembly using RagTag v1.0.1, then assigned these alphenor pseudoscaffolds and all other alphenor scaffolds that hit insect sequences in the NCBI nr database to P. bianor chromosomes, again using RagTag v1.0.1.
Finally, we assembled alphenor doublesex alleles separately and substituted them into the final chromosome-level assembly. These additional steps were necessary because the individuals used for the full Platanus assembly were a mix of homo- and heterozygotes at the dsx locus – while the majority of the genome is homogeneous among these samples, the dsx alleles are very divergent except for the dsx coding region. We assembled all non-mimetic female samples BioProject PRJNA234541 using Platanus 2 and assigned scaffolds to the polytes assembly using RagTag. We then pulled out the polytes region corresponding to the dsx inversion, defined by Nishikawa et al. (2015) as H_locus_nonmimetic_H_scaffold:1931762-2054949, and used it to replace the corresponding region in the chromosome-level alphenor assembly. Similarly, we assembled the mimetic dsx allele by assembling all mimetic female samples from BioProject PRJNA234541 using Platanus 2, assigning those scaffolds to the RefSeq assembly, and extracting the H_locus_mimetic_hetero scaffold. We added this scaffold as chrH in the final alphenor assembly.
We assembled the alphenor mitochondrial genome using NOVOplasty v4.2 using sequencing data from SRR1108726 and the RefSeq mtDNA assembly for polytes (NC_024742.1) as the seed sequence. This resulted in a single circularized sequence of 15,247 bp.
We annotated the alphenor genome using EvidenceModeler 1.1.1. We first assembled a high-quality transcript database using PASA, our SE50 data, and PE100 and SE50 data from Nallu et al. (2018). After adapter trimming, we performed de novo and genome-guided assembly using Trinity v2.10.0 and genome-guided assembly using StringTie v1.3.3. RNA-seq data was also mapped to alphenor chromosomes using STAR 2.6.1d, and the resulting alignments were used to generate genome-guided assemblies with Trinity and StringTie 1.3.1. We combined de novo and genome-guided assemblies using PASA 2.4.1. Evidence for protein-coding regions came from mapping the UniProt/Swiss-Prot (2020_06) database and all Papilionoidea proteins available in NCBI’s GenBank nr protein database (downloaded 6/2020) using exonerate. We identified high-quality multi-exon protein-coding PASA transcripts using TransDecoder (transdecoder.github.io), then used these models to train and run Genemark-ET 4 and GlimmerHMM 3.0.4. We also predicted gene models using Augustus 3.3.2, the supplied heliconius_melpomene1 parameter set, and hints derived from RNA-seq and protein mapping above. Augustus predictions with >90% of their length covered by hints were considered high-quality models. Transcript, protein, and ab initio data were integrated using EVM.
Raw EVM models were then updated twice using PASA to add UTRs and identify alternative transcripts. Gene models derived from transposable element proteins were identified using BLASTp and removed from the annotation set. The final annotation comprises 17,342 genes encoding 26,991 protein-coding transcripts, containing 95% and missing 3% of endopterygota single-copy orthologs according to BUSCO v3 and OrthoDB v9. We functionally annotated protein models using eggNOG’s emapper-2.0.1b utility and the v2.0 eggNOG database.
Transcript and protein sequences were extracted from this final annotation file using Cufflinks' gffread utility.
Usage notes
Standard Unix utilities such as less, grep, and emacs can be used to visualize these text files. Fasta sequences can be retrieved using samtools' faidx utility. Genome fasta and GFF3 annotation files can be directly loaded into a genome viewer such as IGV.