Supplemental data from: Genomic signatures of speciation in butterflies

Cong, Qian 1 ; Zhang, Jing 1 ; Grishin, Nick1

Published Apr 01, 2026 on Dryad. https://doi.org/10.5061/dryad.2v6wwpzwc

Data files

Apr 01, 2026 version files 5.21 GB

README.md

13.55 KB
reference_genomes.tar.gz

920.82 MB
stage1_integrate_nuclear_data.tar.gz

3.35 GB
stage2_integrate_mito_data.tar.gz

1.09 MB
stage3_species_calibration.tar.gz

102.28 KB
stage4_divergence_hotspots.tar.gz

23.75 MB
stage5_running_BPP.tar.gz

3.15 MB
stage6_study_hybrid_zone.tar.gz

901.68 MB

Abstract

Studies of life rely on classifying organisms into species. Contrary to a frequent belief, simple and quantitative standards for species delineation are lacking, and debates about species delimitation create obstacles for conservation biology, agriculture, legislation, and education. To tackle this key biological question, we have chosen butterflies as model organisms. We sequenced and analyzed transcriptomes of 186 butterfly specimens representing pairs of close but clearly distinct species, conspecific populations, and taxa that are debated among experts. We find that species are robustly separated from conspecific populations by the combination of two measures computed on Z-linked genes: the fixation index that detects hiatus between species, and the extent of gene flow that quantifies reproductive isolation. These criteria suggest that all 9 butterfly pairs that caused experts' disagreement are distinct species, not populations or subspecies. When applied to Homo, our criteria agree that all modern humans are the same species, distinct from Neanderthals, suggesting the relevance of this study beyond butterflies. Furthermore, we found that divergence and positive selection in proteins involved in interaction with DNA (including proteins encoded by trans-regulatory elements), circadian clock, pheromone sensing, development, and immune response recurrently correlate with speciation. A significant fraction of these divergent proteins is encoded by the Z chromosome, which appears to be more resistant to introgression than autosomes. Taken together, we find possible common speciation mechanisms in butterflies, present additional support for an important role of the Z chromosome in speciation of butterflies, and suggest quantitative criteria for butterfly species delimitation using genomic data, which is vital for the exploration of biodiversity.

This repository contains a comprehensive analysis pipeline for studying butterfly speciation and divergence using RNA-seq data. The analysis is organized into 6 main stages, each focusing on different aspects of genomic and transcriptomic analysis.

Overview

This pipeline analyzes nuclear and mitochondrial DNA sequences from multiple butterfly species to study:

Species delineation using fixation indices
Divergence hotspots identification
Bayesian phylogenetics and phylogeography (BPP)
Hybrid zone analysis

Directory Structure

`previous/`

Contains older versions of the analysis results as compressed tar.gz files. These serve as backups of previous iterations of each analysis stage.

Files:

reference_genomes.tar.gz - Previous version of reference genome data
stage1_integrate_nuclear_data.tar.gz - Previous nuclear data integration results
stage2_integrate_mito_data.tar.gz - Previous mitochondrial data integration results
stage3_species_calibration.tar.gz - Previous species calibration results
stage4_divergence_hotspots.tar.gz - Previous divergence hotspot analysis
stage5_running_BPP.tar.gz - Previous BPP analysis results
stage6_study_hybrid_zone.tar.gz - Previous hybrid zone study results

`reference_genomes/`

Contains reference genome sequences and protein sets for multiple butterfly species.

Files:

*_genome.fa - Reference genome FASTA files for various species:
- Burnsius_communis_genome.fa - Burnsius communis genome
- Calephelis_nemesis_genome.fa - Calephelis nemesis genome
- Calycopis_cecrops_genome.fa - Calycopis cecrops genome
- Libytheana_carinenta_genome.fa - Libytheana carinenta genome
- Phoebis_sennae_genome.fa - Phoebis sennae genome
- Pterourus_glaucus_genome.fa - Pterourus glaucus genome
*_protein.fa - Reference protein sequence files:
- Calycopis_cecrops_protein.fa - Calycopis cecrops proteins
- Cecropterus_lyciades_protein.fa - Cecropterus lyciades proteins
- Danaus_plexippus_protein.fa - Danaus plexippus proteins
- Pterourus_glaucus_protein.fa - Pterourus glaucus proteins
Lerema_accius.fa - Lerema accius sequence data

`stage1_integrate_nuclear_data/`

Processes nuclear DNA sequences and creates codon/protein mappings for multiple butterfly species.

Species Data Files (pattern: {SPECIES_CODE}_gene.map and {SPECIES_CODE}_prot.map):
Each species has two main files:

*_gene.map - Gene mapping data (nucleotide-level alignments)
*_prot.map - Protein mapping data (amino acid-level alignments)

Processing Scripts:

step1_get_codon_map.py - Converts nucleotide alignments to codon mappings
step2_get_prot_map.py - Processes protein-level alignments

`stage2_integrate_mito_data/`

Analyzes mitochondrial DNA sequences and barcode regions for species identification and divergence analysis.

Species Data Files (per species):

*_barcode.map - Mitochondrial barcode region mappings
*_mito.fa - Complete mitochondrial genome sequences
*_mito.map - Mitochondrial gene mappings

Analysis Scripts:

step1_get_barcode_intra_inter.py - Computes intra- and inter-species divergence from barcode data
step2_get_barcode_divergent_positions.py - Identifies divergent positions in barcode regions
step3_get_all_divergent_positions.py - Analyzes all divergent positions across mitochondrial genome
step4_map2fasta.py - Converts mapping files to FASTA format
dnamap_to_alignment.py - Converts DNA mappings to sequence alignments
get_div_positions.py - Extracts divergent positions from alignments

Configuration Files:

group_partition - Species grouping information for analysis

`stage3_species_calibration/`

Performs species delineation using fixation index (Fst) and gene flow indices, focusing on Z-chromosome analysis.

Species Data Files (per species):

*.zprots - Z-chromosome protein sequences for each species/population
Symbolic links to gene and protein maps from stage1

Analysis Scripts:

step1_merge_haplotypes.py - Merges haplotype data for analysis
step2_prepare_prot_fixation_index.py & step2a_prepare_prot_fixation_index.py - Prepares protein data for Fst calculation
step3_get_overall_fst.py & step3a_get_overall_fst.py - Computes overall fixation index
step4_get_zchr_fst.py & step4a_get_zchr_fst.py - Computes Z-chromosome specific Fst
step5_get_good_positions.py - Identifies high-quality positions for analysis
step6_random_sample.py - Performs random sampling of positions
step7_compute_dmin_daver.py - Computes minimum and average distances
step8_summarize_gmin.py - Summarizes minimum genetic distance data
step9_test_cutoff.py - Tests different cutoff values for species delineation
step10_summarize.py - Creates final summary of results
get_zprot_ratio.py - Calculates Z-chromosome protein ratios

Configuration Files:

case_category - Categories for different analysis cases
group_sample.info - Sample grouping information
list & list.info - Sample lists and metadata
zprot_ratios - Z-chromosome protein ratio results

Results Files:

step*_results - Results from various analysis steps
step*_cmds - Command files for running each analysis step

`stage4_divergence_hotspots/`

Identifies and analyzes genomic regions with elevated divergence between species, with focus on functional annotation.

Species Data Files (per species/population):

*.aprots - Autosomal protein sequences
*.zprots - Z-chromosome protein sequences
*.hotspot - Identified divergence hotspots
*.hotspot.enrich - Functional enrichment analysis of hotspots
*_flybase_id - FlyBase database ID mappings
*_gos - Gene Ontology (GO) annotations

Analysis Scripts:

step3_partition_gene_maps.py & step4_partition_prot_maps.py - Partition gene/protein maps
step5_prepare_gene_fixation_index.py & step6_prepare_prot_fixation_index.py - Prepare Fst data
step7_get_gene_divergent_positions.py & step8_get_prot_divergent_positions.py - Identify divergent positions
step9_get_group_fixation_index.py - Compute group-level Fst
step10_get_group_divergent_positions.py - Get group divergent positions
step11_select_hotspot_by_fst_and_divcount.py - Select hotspots based on Fst and divergence count
step12_get_gos.py & step13_count_gos.py - Gene ontology analysis
step14_get_overall_fst.py - Overall Fst calculation
step15_get_common_gos.py - Common GO term analysis
step16_get_flybase_id.py - Map to FlyBase identifiers
step17_get_common_hotspot.py - Identify common hotspots
step18_get_common_background.py - Background analysis
step19_get_zchr_prots.py - Z-chromosome protein analysis
step20_get_zchr_fst.py - Z-chromosome Fst analysis
step21_analyze_zchr_hotspots.py - Analyze Z-chromosome hotspots
step22_analyze_inter_hotspots.py - Inter-species hotspot analysis

Functional Analysis:

count_gos.py - Count GO terms
get_interaction_significance.py - Test interaction significance
make_overlap_significance_table.py - Create overlap significance tables
prepare_interaction_table.py - Prepare interaction analysis tables
prepare_overlap_tests.py - Prepare overlap test data
test_overlap_significance.py - Test overlap significance
get_dnds.py - Calculate dN/dS ratios
get_zprot_ratio.py - Z-chromosome protein ratios

Reference Data:

go.annotation & go.parents - Gene Ontology annotation and hierarchy
Dm_go_map & Dm_def_map - Drosophila melanogaster GO and definition mappings
interaction_table - Gene interaction data

Common Analysis Results:

common_* files - Results shared across species comparisons
diff_species_* files - Species-specific difference analysis
hotspot_overlap_significance - Statistical significance of hotspot overlaps

`stage5_running_BPP/`

Runs Bayesian Phylogenetics and Phylogeography (BPP) analysis for species delineation.

Species Data Files (per species/population):

*_all_sample.seqs & *_all_sample_filt.seqs - All genomic sequences (raw and filtered)
*_Zchr_sample.seqs & *_Zchr_sample_filt.seqs - Z-chromosome sequences (raw and filtered)
*_good_segs - High-quality genomic segments for analysis
*_GDI_all & *_GDI_zchr - Gene Diversity Index for all chromosomes and Z-chromosome
*.out_nucl & *.out_zchr - BPP output results for nuclear and Z-chromosome data
Symbolic links to .zprots files from stage3

Analysis Scripts:

step1_select_good_segs.py - Select high-quality genomic segments
step2_prepare_input_seqs.py - Prepare sequence inputs for BPP
step3_remove_bad_sequences.py - Filter out low-quality sequences
step4_prepare_BPP_inputs_all.py - Prepare BPP inputs for all chromosomes
step4_prepare_BPP_inputs_zchr.py - Prepare BPP inputs for Z-chromosome
step5_check_convergence.py - Check MCMC convergence
step6_summarize_gdi.py - Summarize Gene Diversity Index results
step7_process_BPP_results.py - Process final BPP results

Configuration Files:

BPP_input_info - BPP input file information
group_sample.info - Sample grouping for BPP analysis
step*_cmds - Command files for each analysis step
step*_list - Sample lists for different steps
barcode_divcount - Barcode divergence counts

`stage6_study_hybrid_zone/`

Analyzes sympatric specimens and potential hybrid zones between closely related species.

Species-Specific Data:
Each species has multiple file types:

*_geno.fa - Genomic sequence data
*_exon.fasta - Exon sequences
*_cdna.fa - cDNA sequences
*_prot.fa - Protein sequences
*.gff - Gene annotation files
*.map - Mapping files
*_good_exons - High-quality exon data
*_exon_nr.fasta - Non-redundant exon sequences
*_exon.list - Exon lists
*_exon.size - Exon size information
*.gmin - Minimum genetic distance data
*.intra_inter - Intra- and inter-specific distance data
*.zprots - Z-chromosome protein data

Species Included:

Buph - Species being studied for hybridization
Hela - Related species
Lica - Libytheana carinenta
Lca - Target species for hybrid zone analysis
Pac - Comparison species for hybrid zone analysis

Analysis Scripts:

step1_get_consensus_seqs.py - Generate consensus sequences
step2_split_to_exon.py - Split sequences into exons
step3_map_to_exons.py - Map sequences to exon references
step4_remove_redundant_exons.py - Remove redundant exon sequences
step6_process_xml.py - Process XML format results
step7_get_first_hits.py - Extract first BLAST hits
step8_get_coverage.py - Calculate sequence coverage
step9_get_good_exons.py - Identify high-quality exons
step10_blast_to_MSA.py - Convert BLAST results to multiple sequence alignment
step11_filter_reads.py - Filter sequencing reads
step12_get_consensus.py - Generate consensus sequences
step13_merge_map.py - Merge mapping data
step14_merge_haplotypes.py - Merge haplotype information
step15_prepare_prot_fixation_index.py - Prepare protein Fst data
step16_get_overall_fst.py - Calculate overall Fst
step17_get_zchr_fst.py - Calculate Z-chromosome Fst
step18_get_good_positions.py - Identify good positions for analysis
step19_random_sample.py - Random sampling of positions
step20_compute_dmin_daver.py - Compute distance metrics
step21_summarize_gmin.py - Summarize minimum genetic distances
step22_summarize.py - Final summary of results

Reference Data:

*_exons.dmnd - Diamond database files for exon matching
job_header - Job submission header
sample_info - Sample metadata

Results Files:

step*_result - Results from various analysis steps
step*_cmds - Command files for each step
step*_list - Lists used in different analysis steps
Large .map files containing processed mapping data for DNA and protein sequences

Analysis Workflow

Stage 1: Process nuclear DNA sequences and create gene/protein mappings
Stage 2: Analyze mitochondrial DNA and barcode regions
Stage 3: Perform species calibration using fixation indices
Stage 4: Identify divergence hotspots and perform functional enrichment
Stage 5: Run Bayesian phylogenetic analysis for species delineation
Stage 6: Study hybrid zones and sympatric specimens

File Naming Conventions

*.map - Sequence mapping files (nucleotide or amino acid alignments)
*.fa / *.fasta - FASTA sequence files
*.zprots - Z-chromosome protein sequences
*.aprots - Autosomal protein sequences
*.hotspot - Divergence hotspot regions
*.gos - Gene Ontology annotations
step*_*.py - Analysis scripts for each processing step
*_cmds - Command files for running analyses
*_results - Output results from analyses

Species Codes Reference

The pipeline analyzes data from multiple butterfly species, identified by 3-letter codes:

Each species code represents a different butterfly species or population
Nuclear and mitochondrial data are processed separately
Z-chromosome data receives special attention due to its role in sex determination and speciation

This comprehensive analysis pipeline enables detailed study of butterfly evolution, speciation processes, and genomic divergence patterns.

Sample Collection

We assembled 25 pairs of butterfly counterparts from the eastern and western sides of the Central Texas suture zone. For each pair, we gathered 2 - 42 specimens on each side. Our analysis focused on protein-coding sequences. For 21 of the 25 pairs, we obtained their sequences using RNA-seq. For the remaining pairs in the Calephelis, Calycopis, Phoebis, and Pterourus genera, we did not have specimens preserved for RNA-seq. Instead, we relied on previously published genomic DNA data. These genomic DNA data were obtained via Illumina sequencing of paired-end libraries, targeting 10-fold coverage for each library. We obtained protein-coding sequences for these four genera by mapping the genomic reads to their annotated reference genomes and extracting the coding regions.

Library Preparation and Sequencing

Specimens for RNA-seq libraries were euthanized upon capture by thorax pinching, and their bodies (excluding wings and genitalia) were preserved in RNAlater solution. Total RNA was extracted from each specimen using a QIAGEN RNeasy Plus Mini Kit, and mRNA was subsequently isolated using a NEBNext Poly(A) mRNA Magnetic Isolation Module. RNA-seq libraries were then prepared with a NEBNext Ultra Directional RNA Library Prep Kit. We pooled 12 RNA-seq libraries and sequenced each pool using paired-end 150 bp sequencing on an Illumina Hiseq2500. Specimens for genomic DNA libraries were either collected in the field (and stored in EtOH) or borrowed from collections listed in the Acknowledgements. We extracted DNA from a piece of thoracic tissue for fresh specimens, and from either the abdomen or a leg for pinned museum specimens, using a Macherey-Nagel NucleoSpin tissue kit. Genomic DNA libraries were prepared with a NEBNext Ultra II DNA Library Prep Kit. We pooled these libraries based on their expected genome size, targeting a 10x sequencing depth per specimen. We sequenced the pooled genomic libraries using an Illumina HiSeq X Ten sequencing service from Genewiz.

Assembling Reference Transcriptomes

Sequencing adapters and low-quality portions (Phred score < 20) were removed from both genomic DNA and RNA-seq reads using AdapterRemoval. For genomic DNA, protein-coding sequences were obtained by mapping reads to annotated reference genomes and extracting coding regions. RNA-seq reads were used to assemble reference transcripts to derive protein-coding sequences.

We applied Trinity (version r20140413p1) to de novo assemble the transcriptome separately for each specimen with RNA-seq data. Because the resulting single-specimen transcriptomes were incomplete, we merged them for each taxon pair to obtain a more complete set of transcripts. We applied BLASTX (e-value: 0.00001, version 2.2.31+) to map the combined set of transcripts to a reference protein set, selecting the most closely related one among the following five references: Cecropterus [formely Achalarus] lyciades, Calycopis cecrops, Danaus plexippus, Lerema accius, and Pterourus glaucus. Transcripts without confident (e-value <= 0.00001) hits among reference proteins were discarded. We filtered the BLASTX hits by requiring the alignment to cover at least 50 % of residues in the reference protein, and the remaining hits were ranked primarily by e-value and secondarily by bit score. From the ranked list, we identified the best hits that aligned to non-overlapping regions in the query transcript. Typically, there was a single best hit per transcript; in cases where multiple non-overlapping best hits were found, the query transcript was split to multiple segments corresponding to these hits.

Due to the pooling of transcripts from different specimens and the presence of alternatively spliced forms, the resulting transcript set for each taxon pair was highly redundant. To remove this redundancy, we aligned transcripts that mapped to the same reference protein against each other using BLASTN (version 2.2.31+). Our goal was to represent each gene by its longest isoform, and we removed other isoforms and redundant transcripts through the following procedure. Shorter transcripts were removed if at least 80 % of their residues aligned to a longer transcript under one of the following high-identity criteria: (1) over 95 % sequence identity or (2) over 90 % sequence identity and sharing at least one identical 40-mer. Furthermore, to obtain a more complete representation of each gene, we merged de novo assembled transcripts that represented partial genes. If two transcripts aligned to different portions of a reference protein with at least 20 residues of overlap, they were merged. In the overlapping region, the sequence most similar to the reference protein was selected. This process resulted in a single set of reference transcripts for each taxon pair.

The gene content on the Lepidoptera Z chromosome is highly conserved, allowing us to infer Z-linked genes in other species by comparing them to the Heliconius melpomene reference genome. We used a reciprocal best hit approach to identify Z-linked genes/transcripts in each of our 25 taxon pairs. First, each transcript from the non-redundant reference transcript set was used in a BLASTX search against all Heliconius proteins to find the best-matching hit. If this top hit was a known Z-linked protein, we then performed a reciprocal TBLASTN search, using the Heliconius protein as the query against all reference transcripts for the taxon pair. A transcript was assigned as Z-linked only if it was the best hit in both the forward and reciprocal searches.

Obtaining Sequence Alignments for Each Taxon Pair

We aligned the RNA-seq reads to the reference transcripts using BWA (version 0.6.2-r126) and performed SNP calling with GATK (version 3.3-0). For each specimen, we derived its sequence from the BWA alignments and GATK outputs. Positions with a sequencing depth of less than two were considered gaps. For the remaining positions, we incorporated the called variants; otherwise, the position was assumed to be the same as the reference. For the Calephelis, Calycopis, Phoebis, and Pterourus genera, we used the BWA-GATK pipeline to align genomic DNA reads to the reference genomes and derive a genomic sequence for each specimen. We then extracted the protein-coding sequences from these genomic sequences. Finally, we filtered the protein-coding sequences to keep only those that were covered for at least 60 bp by a minimum of two specimens from the western side and two from the eastern side of the suture zone. This process resulted in a final set of 10,675 to 16,494 non-redundant reference protein-coding sequences for each taxon pair.

The assembled protein-coding sequences included mitochondrial genes, such as the cytochrome c oxidase subunit I (COI) gene, which contains the standard barcode sequence used for species delineation. We extracted the COI barcode sequences for our specimens by mapping known barcodes from the Barcode of Life Data System (BOLD) to the reference protein-coding sequences of each taxon pair.

Designing Measurements to Separate Species from Non-species Pairs

Using the multiple sequence alignments (MSAs) for protein-coding sequences, we computed the fixation index (F_ST) and the index of gene flow (I_GF) for each of the 25 taxon pairs. Both measures were calculated for all transcripts and for Z-linked transcripts only. We computed F_ST on both the DNA sequences and the protein sequences they encode. To avoid bias from varying sample sizes, we randomly sampled two specimens from each taxon at each position to compute F_ST. We computed F_ST using only positions where all four sampled specimens were present (non-gap) in the MSAs. F_ST was calculated as 1- π_within/π_between. π_within was the fraction of positions where the two specimens from the same taxon had different nucleotides (or amino acids for protein-level F_ST). π_between was the fraction of positions where two specimens from different taxa had different nucleotides or amino acids. F_ST was calculated for either taxa, and the final F_ST for the pair was the average of these two values.

A high F_ST value can sometimes be misleading, because it may arise from low intra-taxon divergence caused by bottleneck or sampling of close relatives, rather than from true speciation. To overcome this limitation, we estimated gene flow using G_min, a measure that detects introgression by identifying high haplotype similarity between specimens from different taxa. To mitigate potential bias from sample size on G_min values, for each gene/transcript, we randomly sampled two specimens from each taxon (four specimens for each pair) to calculate G_min.

We computed G_min in non-overlapping sliding windows of 1 kb, and coding sequences shorter than 1 kb were excluded. Within each window, we computed the minimal Hamming distance (d_min) between any two specimens from different taxa and the average Hamming distance (d_avr) between any pair of specimens. We filtered out windows where d_avr was less than 2, and the remaining number of windows was designated as N_total. For each window, we computed G_min as d_min/d_avr. We considered G_min lower than 0.25 to indicate introgression in a window, and the total number of such windows was designated as N_intro. Since introgression in any of the four analyzed specimens could lead to a low G_min, we used I_GF = N_intro / (4 * N_total) to reflect the average fraction of introgressed windows. We found that while changing the G_min cutoff from 0.1 (more stringent) to 0.6 (less stringent) altered the absolute values of I_GF, it did not affect the observed distinction between species and conspecific pairs.

To test candidate species pairs predicted by these genomic measures, we collected sympatric specimens for two pairs: Libytheana carinenta bachmanii and L. carinenta larvata, and Burnsius communis and B. albenzens. We obtained genomic sequences from these specimens, mapped the reads to the reference genomes we had for L. carinenta and B. communis, and derived the genomic sequence for each specimen using the BWA-GATK pipeline. To validate our findings, we repeated the calculations for the F_ST and the I_GF on these sympatric population pairs as described above.

Testing Other Approaches for Species Delineation

We used the Bayesian Phylogenetics and Phylogeography (BPP) software (version 4.7.0) to test the species delimitation of our candidate pairs based on a multi-species coalescent model. For each taxon pair, we randomly sampled 100 loci from our MSAs of Z-linked genes, with each locus ranging from 250 to 500 bp. We ensured that each selected locus contained at least 200 non-gap positions with data for a minimum of two specimens per taxon. The MSAs for 100 Z-linked loci, respectively, were used as inputs for BPP.

We specified priors for θ and τ using the same algorithms as in Minimalist BPP, the web server recommended in the BPP tutorial for obtaining sensible default priors. We initiated 5 independent Markov chains with different random seeds. Each chain was run with the parameters sampfreq = 5 and nsample = 500,000. To ensure convergence, the first 20 % of samples were discarded as burn-in, and convergence across chains was evaluated using the Gelman–Rubin diagnostic (cutoff < 1.01) with effective sample sizes (ESS) > 200. Chains that met these criteria were used to calculate the Genealogical Divergence Index (GDI) based on the posterior estimates of τ and θ from the Markov Chain Monte Carlo samples generated by BPP.

Identifying Divergence Hotspots between Species

We hypothesized that proteins with significant divergence between species and low polymorphism within each species are more likely to be involved in speciation. We named these "divergence hotspots" and identified them in each species pair based on two criteria. First, we calculated F_STfor each protein, and selected the top 20 % of proteins with the highest F_ST values. Second, from this set, we identified proteins that were significantly enriched in residues that diverged between the two species. We only considered positions present in at least two specimens of each species, denoting the total number of such positions in protein i as Ntot_i. Among these, we identified divergent positions where the dominant amino acids (in more than 75 % of specimens) of the two species differed. The total number of such divergent positions in protein i was denoted as Ndiv_i. The enrichment was quantified using a binomial test from the Python scipy module for each protein, with the following parameters: p = (∑ Ndiv_i) / (∑ Ntot_i), i.e., average rate of divergent positions across all proteins; x = Ndiv_i, n = Ntot_i.

To identify recurring divergence hotspots, we mapped the reference transcripts/protein-coding sequences for each species pair to Drosophila proteins in Flybase by identifying the top BLASTX hits. We only considered Flybase entries that were mapped to by reference sequences from a majority (at least 8 out of 15) of these species pairs. We then counted how many times each Flybase entry was among the divergence hotspots across all 15 species pairs. If a gene appeared significantly more often than would be expected by chance, it was considered a recurring divergence hotspot. To establish a statistical threshold for this significance, we performed a random simulation with 10,000 repetitions. In each simulation, we randomly selected a number of genes for each species pair equal to the actual number of divergence hotspots we identified in that pair. For each gene, we then selected the 5 % of simulations showing the highest frequency for that gene. The lowest frequency of observing a gene among these selected simulations was used as the cutoff for an observed frequency to be significantly higher than random (P < 0.05).

We identified enriched Gene Ontology (GO) terms associated with the recurring divergence hotspots using a binomial test (x = the number of recurring divergence hotspots associated with this GO term, n = the total number of recurring divergence hotspots, p = the probability for this GO term to be associated with any Flybase entry being analyzed). Similar analyses were performed on the positively selected recurring divergence hotspots and all the Z-linked genes. The list of enriched GO terms and their p-values was analyzed using REVIGO, a tool that summarizes and visualizes long, redundant lists of terms. REVIGO produced a network representing these terms, linking those that tend to be associated with the same proteins. We then manually arranged this network in Cytoscape to place semantically similar GO terms in close proximity.

Detecting Positively Selected Genes between Species

We mapped all reference transcripts/protein-coding sequences for each species pair to their respective Drosophila genes in Flybase as described above. We then concatenated the MSAs for multiple references from a species pair that mapped to the same Drosophila gene. We tested whether a gene was positively selected during the divergence of species pairs in our study using the McDonald–Kreitman (MK) test. This test compares the ratio of nonsynonymous (D_N) to synonymous (D_S) substitutions between species with the ratio of nonsynonymous (P_N) to synonymous (P_S) polymorphisms within species.

For each gene, we first estimated the number of within-species polymorphisms. For every pair of specimens within a species, we counted the number of P_N and P_S substitutions to change from the sequence of one specimen to another. We chose the most parsimonious path with the fewest nonsynonymous changes to account for multiple substitution paths. These values were summed across all specimens within the 15 pairs to obtain the total number of P_N and P_S polymorphisms for each gene. Next, we estimated the number of inter-species divergences. For each of the 15 species pairs, we counted the number of D_N and D_S substitutions to change from the consensus codons of one species to those of the other, again favoring the most parsimonious path. We summed these values over all 15 species pairs to obtain the total number of D_N and D_S divergences for each gene. Finally, to calculate statistical significance for positive selection in each gene, we used Fisher’s exact test to compare the ratio of D_N / D_S with the ratio of P_N /P_S. A gene was considered to show a significant sign of positive selection if its P-value was less than 0.05.

Supplemental data from: Genomic signatures of speciation in butterflies

Data files

Abstract

README: Butterfly RNA Analysis Pipeline

Overview

Directory Structure

previous/

reference_genomes/

stage1_integrate_nuclear_data/

stage2_integrate_mito_data/

stage3_species_calibration/

stage4_divergence_hotspots/

stage5_running_BPP/

stage6_study_hybrid_zone/

Analysis Workflow

File Naming Conventions

Species Codes Reference

Methods

`previous/`

`reference_genomes/`

`stage1_integrate_nuclear_data/`

`stage2_integrate_mito_data/`

`stage3_species_calibration/`

`stage4_divergence_hotspots/`

`stage5_running_BPP/`

`stage6_study_hybrid_zone/`