Supplemental data from: Genomic signatures of speciation in butterflies
Data files
Apr 01, 2026 version files 5.21 GB
-
README.md
13.55 KB
-
reference_genomes.tar.gz
920.82 MB
-
stage1_integrate_nuclear_data.tar.gz
3.35 GB
-
stage2_integrate_mito_data.tar.gz
1.09 MB
-
stage3_species_calibration.tar.gz
102.28 KB
-
stage4_divergence_hotspots.tar.gz
23.75 MB
-
stage5_running_BPP.tar.gz
3.15 MB
-
stage6_study_hybrid_zone.tar.gz
901.68 MB
Abstract
Studies of life rely on classifying organisms into species. Contrary to a frequent belief, simple and quantitative standards for species delineation are lacking, and debates about species delimitation create obstacles for conservation biology, agriculture, legislation, and education. To tackle this key biological question, we have chosen butterflies as model organisms. We sequenced and analyzed transcriptomes of 186 butterfly specimens representing pairs of close but clearly distinct species, conspecific populations, and taxa that are debated among experts. We find that species are robustly separated from conspecific populations by the combination of two measures computed on Z-linked genes: the fixation index that detects hiatus between species, and the extent of gene flow that quantifies reproductive isolation. These criteria suggest that all 9 butterfly pairs that caused experts' disagreement are distinct species, not populations or subspecies. When applied to Homo, our criteria agree that all modern humans are the same species, distinct from Neanderthals, suggesting the relevance of this study beyond butterflies. Furthermore, we found that divergence and positive selection in proteins involved in interaction with DNA (including proteins encoded by trans-regulatory elements), circadian clock, pheromone sensing, development, and immune response recurrently correlate with speciation. A significant fraction of these divergent proteins is encoded by the Z chromosome, which appears to be more resistant to introgression than autosomes. Taken together, we find possible common speciation mechanisms in butterflies, present additional support for an important role of the Z chromosome in speciation of butterflies, and suggest quantitative criteria for butterfly species delimitation using genomic data, which is vital for the exploration of biodiversity.
This repository contains a comprehensive analysis pipeline for studying butterfly speciation and divergence using RNA-seq data. The analysis is organized into 6 main stages, each focusing on different aspects of genomic and transcriptomic analysis.
Overview
This pipeline analyzes nuclear and mitochondrial DNA sequences from multiple butterfly species to study:
- Species delineation using fixation indices
- Divergence hotspots identification
- Bayesian phylogenetics and phylogeography (BPP)
- Hybrid zone analysis
Directory Structure
previous/
Contains older versions of the analysis results as compressed tar.gz files. These serve as backups of previous iterations of each analysis stage.
Files:
reference_genomes.tar.gz- Previous version of reference genome datastage1_integrate_nuclear_data.tar.gz- Previous nuclear data integration resultsstage2_integrate_mito_data.tar.gz- Previous mitochondrial data integration resultsstage3_species_calibration.tar.gz- Previous species calibration resultsstage4_divergence_hotspots.tar.gz- Previous divergence hotspot analysisstage5_running_BPP.tar.gz- Previous BPP analysis resultsstage6_study_hybrid_zone.tar.gz- Previous hybrid zone study results
reference_genomes/
Contains reference genome sequences and protein sets for multiple butterfly species.
Files:
*_genome.fa- Reference genome FASTA files for various species:Burnsius_communis_genome.fa- Burnsius communis genomeCalephelis_nemesis_genome.fa- Calephelis nemesis genomeCalycopis_cecrops_genome.fa- Calycopis cecrops genomeLibytheana_carinenta_genome.fa- Libytheana carinenta genomePhoebis_sennae_genome.fa- Phoebis sennae genomePterourus_glaucus_genome.fa- Pterourus glaucus genome
*_protein.fa- Reference protein sequence files:Calycopis_cecrops_protein.fa- Calycopis cecrops proteinsCecropterus_lyciades_protein.fa- Cecropterus lyciades proteinsDanaus_plexippus_protein.fa- Danaus plexippus proteinsPterourus_glaucus_protein.fa- Pterourus glaucus proteins
Lerema_accius.fa- Lerema accius sequence data
stage1_integrate_nuclear_data/
Processes nuclear DNA sequences and creates codon/protein mappings for multiple butterfly species.
Species Data Files (pattern: {SPECIES_CODE}_gene.map and {SPECIES_CODE}_prot.map):
Each species has two main files:
*_gene.map- Gene mapping data (nucleotide-level alignments)*_prot.map- Protein mapping data (amino acid-level alignments)
Processing Scripts:
step1_get_codon_map.py- Converts nucleotide alignments to codon mappingsstep2_get_prot_map.py- Processes protein-level alignments
stage2_integrate_mito_data/
Analyzes mitochondrial DNA sequences and barcode regions for species identification and divergence analysis.
Species Data Files (per species):
*_barcode.map- Mitochondrial barcode region mappings*_mito.fa- Complete mitochondrial genome sequences*_mito.map- Mitochondrial gene mappings
Analysis Scripts:
step1_get_barcode_intra_inter.py- Computes intra- and inter-species divergence from barcode datastep2_get_barcode_divergent_positions.py- Identifies divergent positions in barcode regionsstep3_get_all_divergent_positions.py- Analyzes all divergent positions across mitochondrial genomestep4_map2fasta.py- Converts mapping files to FASTA formatdnamap_to_alignment.py- Converts DNA mappings to sequence alignmentsget_div_positions.py- Extracts divergent positions from alignments
Configuration Files:
group_partition- Species grouping information for analysis
stage3_species_calibration/
Performs species delineation using fixation index (Fst) and gene flow indices, focusing on Z-chromosome analysis.
Species Data Files (per species):
*.zprots- Z-chromosome protein sequences for each species/population- Symbolic links to gene and protein maps from stage1
Analysis Scripts:
step1_merge_haplotypes.py- Merges haplotype data for analysisstep2_prepare_prot_fixation_index.py&step2a_prepare_prot_fixation_index.py- Prepares protein data for Fst calculationstep3_get_overall_fst.py&step3a_get_overall_fst.py- Computes overall fixation indexstep4_get_zchr_fst.py&step4a_get_zchr_fst.py- Computes Z-chromosome specific Fststep5_get_good_positions.py- Identifies high-quality positions for analysisstep6_random_sample.py- Performs random sampling of positionsstep7_compute_dmin_daver.py- Computes minimum and average distancesstep8_summarize_gmin.py- Summarizes minimum genetic distance datastep9_test_cutoff.py- Tests different cutoff values for species delineationstep10_summarize.py- Creates final summary of resultsget_zprot_ratio.py- Calculates Z-chromosome protein ratios
Configuration Files:
case_category- Categories for different analysis casesgroup_sample.info- Sample grouping informationlist&list.info- Sample lists and metadatazprot_ratios- Z-chromosome protein ratio results
Results Files:
step*_results- Results from various analysis stepsstep*_cmds- Command files for running each analysis step
stage4_divergence_hotspots/
Identifies and analyzes genomic regions with elevated divergence between species, with focus on functional annotation.
Species Data Files (per species/population):
*.aprots- Autosomal protein sequences*.zprots- Z-chromosome protein sequences*.hotspot- Identified divergence hotspots*.hotspot.enrich- Functional enrichment analysis of hotspots*_flybase_id- FlyBase database ID mappings*_gos- Gene Ontology (GO) annotations
Analysis Scripts:
step3_partition_gene_maps.py&step4_partition_prot_maps.py- Partition gene/protein mapsstep5_prepare_gene_fixation_index.py&step6_prepare_prot_fixation_index.py- Prepare Fst datastep7_get_gene_divergent_positions.py&step8_get_prot_divergent_positions.py- Identify divergent positionsstep9_get_group_fixation_index.py- Compute group-level Fststep10_get_group_divergent_positions.py- Get group divergent positionsstep11_select_hotspot_by_fst_and_divcount.py- Select hotspots based on Fst and divergence countstep12_get_gos.py&step13_count_gos.py- Gene ontology analysisstep14_get_overall_fst.py- Overall Fst calculationstep15_get_common_gos.py- Common GO term analysisstep16_get_flybase_id.py- Map to FlyBase identifiersstep17_get_common_hotspot.py- Identify common hotspotsstep18_get_common_background.py- Background analysisstep19_get_zchr_prots.py- Z-chromosome protein analysisstep20_get_zchr_fst.py- Z-chromosome Fst analysisstep21_analyze_zchr_hotspots.py- Analyze Z-chromosome hotspotsstep22_analyze_inter_hotspots.py- Inter-species hotspot analysis
Functional Analysis:
count_gos.py- Count GO termsget_interaction_significance.py- Test interaction significancemake_overlap_significance_table.py- Create overlap significance tablesprepare_interaction_table.py- Prepare interaction analysis tablesprepare_overlap_tests.py- Prepare overlap test datatest_overlap_significance.py- Test overlap significanceget_dnds.py- Calculate dN/dS ratiosget_zprot_ratio.py- Z-chromosome protein ratios
Reference Data:
go.annotation&go.parents- Gene Ontology annotation and hierarchyDm_go_map&Dm_def_map- Drosophila melanogaster GO and definition mappingsinteraction_table- Gene interaction data
Common Analysis Results:
common_*files - Results shared across species comparisonsdiff_species_*files - Species-specific difference analysishotspot_overlap_significance- Statistical significance of hotspot overlaps
stage5_running_BPP/
Runs Bayesian Phylogenetics and Phylogeography (BPP) analysis for species delineation.
Species Data Files (per species/population):
*_all_sample.seqs&*_all_sample_filt.seqs- All genomic sequences (raw and filtered)*_Zchr_sample.seqs&*_Zchr_sample_filt.seqs- Z-chromosome sequences (raw and filtered)*_good_segs- High-quality genomic segments for analysis*_GDI_all&*_GDI_zchr- Gene Diversity Index for all chromosomes and Z-chromosome*.out_nucl&*.out_zchr- BPP output results for nuclear and Z-chromosome data- Symbolic links to
.zprotsfiles from stage3
Analysis Scripts:
step1_select_good_segs.py- Select high-quality genomic segmentsstep2_prepare_input_seqs.py- Prepare sequence inputs for BPPstep3_remove_bad_sequences.py- Filter out low-quality sequencesstep4_prepare_BPP_inputs_all.py- Prepare BPP inputs for all chromosomesstep4_prepare_BPP_inputs_zchr.py- Prepare BPP inputs for Z-chromosomestep5_check_convergence.py- Check MCMC convergencestep6_summarize_gdi.py- Summarize Gene Diversity Index resultsstep7_process_BPP_results.py- Process final BPP results
Configuration Files:
BPP_input_info- BPP input file informationgroup_sample.info- Sample grouping for BPP analysisstep*_cmds- Command files for each analysis stepstep*_list- Sample lists for different stepsbarcode_divcount- Barcode divergence counts
stage6_study_hybrid_zone/
Analyzes sympatric specimens and potential hybrid zones between closely related species.
Species-Specific Data:
Each species has multiple file types:
*_geno.fa- Genomic sequence data*_exon.fasta- Exon sequences*_cdna.fa- cDNA sequences*_prot.fa- Protein sequences*.gff- Gene annotation files*.map- Mapping files*_good_exons- High-quality exon data*_exon_nr.fasta- Non-redundant exon sequences*_exon.list- Exon lists*_exon.size- Exon size information*.gmin- Minimum genetic distance data*.intra_inter- Intra- and inter-specific distance data*.zprots- Z-chromosome protein data
Species Included:
Buph- Species being studied for hybridizationHela- Related speciesLica- Libytheana carinentaLca- Target species for hybrid zone analysisPac- Comparison species for hybrid zone analysis
Analysis Scripts:
step1_get_consensus_seqs.py- Generate consensus sequencesstep2_split_to_exon.py- Split sequences into exonsstep3_map_to_exons.py- Map sequences to exon referencesstep4_remove_redundant_exons.py- Remove redundant exon sequencesstep6_process_xml.py- Process XML format resultsstep7_get_first_hits.py- Extract first BLAST hitsstep8_get_coverage.py- Calculate sequence coveragestep9_get_good_exons.py- Identify high-quality exonsstep10_blast_to_MSA.py- Convert BLAST results to multiple sequence alignmentstep11_filter_reads.py- Filter sequencing readsstep12_get_consensus.py- Generate consensus sequencesstep13_merge_map.py- Merge mapping datastep14_merge_haplotypes.py- Merge haplotype informationstep15_prepare_prot_fixation_index.py- Prepare protein Fst datastep16_get_overall_fst.py- Calculate overall Fststep17_get_zchr_fst.py- Calculate Z-chromosome Fststep18_get_good_positions.py- Identify good positions for analysisstep19_random_sample.py- Random sampling of positionsstep20_compute_dmin_daver.py- Compute distance metricsstep21_summarize_gmin.py- Summarize minimum genetic distancesstep22_summarize.py- Final summary of results
Reference Data:
*_exons.dmnd- Diamond database files for exon matchingjob_header- Job submission headersample_info- Sample metadata
Results Files:
step*_result- Results from various analysis stepsstep*_cmds- Command files for each stepstep*_list- Lists used in different analysis steps- Large
.mapfiles containing processed mapping data for DNA and protein sequences
Analysis Workflow
- Stage 1: Process nuclear DNA sequences and create gene/protein mappings
- Stage 2: Analyze mitochondrial DNA and barcode regions
- Stage 3: Perform species calibration using fixation indices
- Stage 4: Identify divergence hotspots and perform functional enrichment
- Stage 5: Run Bayesian phylogenetic analysis for species delineation
- Stage 6: Study hybrid zones and sympatric specimens
File Naming Conventions
*.map- Sequence mapping files (nucleotide or amino acid alignments)*.fa/*.fasta- FASTA sequence files*.zprots- Z-chromosome protein sequences*.aprots- Autosomal protein sequences*.hotspot- Divergence hotspot regions*.gos- Gene Ontology annotationsstep*_*.py- Analysis scripts for each processing step*_cmds- Command files for running analyses*_results- Output results from analyses
Species Codes Reference
The pipeline analyzes data from multiple butterfly species, identified by 3-letter codes:
- Each species code represents a different butterfly species or population
- Nuclear and mitochondrial data are processed separately
- Z-chromosome data receives special attention due to its role in sex determination and speciation
This comprehensive analysis pipeline enables detailed study of butterfly evolution, speciation processes, and genomic divergence patterns.
Sample Collection
We assembled 25 pairs of butterfly counterparts from the eastern and western sides of the Central Texas suture zone. For each pair, we gathered 2 - 42 specimens on each side. Our analysis focused on protein-coding sequences. For 21 of the 25 pairs, we obtained their sequences using RNA-seq. For the remaining pairs in the Calephelis, Calycopis, Phoebis, and Pterourus genera, we did not have specimens preserved for RNA-seq. Instead, we relied on previously published genomic DNA data. These genomic DNA data were obtained via Illumina sequencing of paired-end libraries, targeting 10-fold coverage for each library. We obtained protein-coding sequences for these four genera by mapping the genomic reads to their annotated reference genomes and extracting the coding regions.
Library Preparation and Sequencing
Specimens for RNA-seq libraries were euthanized upon capture by thorax pinching, and their bodies (excluding wings and genitalia) were preserved in RNAlater solution. Total RNA was extracted from each specimen using a QIAGEN RNeasy Plus Mini Kit, and mRNA was subsequently isolated using a NEBNext Poly(A) mRNA Magnetic Isolation Module. RNA-seq libraries were then prepared with a NEBNext Ultra Directional RNA Library Prep Kit. We pooled 12 RNA-seq libraries and sequenced each pool using paired-end 150 bp sequencing on an Illumina Hiseq2500. Specimens for genomic DNA libraries were either collected in the field (and stored in EtOH) or borrowed from collections listed in the Acknowledgements. We extracted DNA from a piece of thoracic tissue for fresh specimens, and from either the abdomen or a leg for pinned museum specimens, using a Macherey-Nagel NucleoSpin tissue kit. Genomic DNA libraries were prepared with a NEBNext Ultra II DNA Library Prep Kit. We pooled these libraries based on their expected genome size, targeting a 10x sequencing depth per specimen. We sequenced the pooled genomic libraries using an Illumina HiSeq X Ten sequencing service from Genewiz.
Assembling Reference Transcriptomes
Sequencing adapters and low-quality portions (Phred score < 20) were removed from both genomic DNA and RNA-seq reads using AdapterRemoval. For genomic DNA, protein-coding sequences were obtained by mapping reads to annotated reference genomes and extracting coding regions. RNA-seq reads were used to assemble reference transcripts to derive protein-coding sequences.
We applied Trinity (version r20140413p1) to de novo assemble the transcriptome separately for each specimen with RNA-seq data. Because the resulting single-specimen transcriptomes were incomplete, we merged them for each taxon pair to obtain a more complete set of transcripts. We applied BLASTX (e-value: 0.00001, version 2.2.31+) to map the combined set of transcripts to a reference protein set, selecting the most closely related one among the following five references: Cecropterus [formely Achalarus] lyciades, Calycopis cecrops, Danaus plexippus, Lerema accius, and Pterourus glaucus. Transcripts without confident (e-value <= 0.00001) hits among reference proteins were discarded. We filtered the BLASTX hits by requiring the alignment to cover at least 50 % of residues in the reference protein, and the remaining hits were ranked primarily by e-value and secondarily by bit score. From the ranked list, we identified the best hits that aligned to non-overlapping regions in the query transcript. Typically, there was a single best hit per transcript; in cases where multiple non-overlapping best hits were found, the query transcript was split to multiple segments corresponding to these hits.
Due to the pooling of transcripts from different specimens and the presence of alternatively spliced forms, the resulting transcript set for each taxon pair was highly redundant. To remove this redundancy, we aligned transcripts that mapped to the same reference protein against each other using BLASTN (version 2.2.31+). Our goal was to represent each gene by its longest isoform, and we removed other isoforms and redundant transcripts through the following procedure. Shorter transcripts were removed if at least 80 % of their residues aligned to a longer transcript under one of the following high-identity criteria: (1) over 95 % sequence identity or (2) over 90 % sequence identity and sharing at least one identical 40-mer. Furthermore, to obtain a more complete representation of each gene, we merged de novo assembled transcripts that represented partial genes. If two transcripts aligned to different portions of a reference protein with at least 20 residues of overlap, they were merged. In the overlapping region, the sequence most similar to the reference protein was selected. This process resulted in a single set of reference transcripts for each taxon pair.
The gene content on the Lepidoptera Z chromosome is highly conserved, allowing us to infer Z-linked genes in other species by comparing them to the Heliconius melpomene reference genome. We used a reciprocal best hit approach to identify Z-linked genes/transcripts in each of our 25 taxon pairs. First, each transcript from the non-redundant reference transcript set was used in a BLASTX search against all Heliconius proteins to find the best-matching hit. If this top hit was a known Z-linked protein, we then performed a reciprocal TBLASTN search, using the Heliconius protein as the query against all reference transcripts for the taxon pair. A transcript was assigned as Z-linked only if it was the best hit in both the forward and reciprocal searches.
Obtaining Sequence Alignments for Each Taxon Pair
We aligned the RNA-seq reads to the reference transcripts using BWA (version 0.6.2-r126) and performed SNP calling with GATK (version 3.3-0). For each specimen, we derived its sequence from the BWA alignments and GATK outputs. Positions with a sequencing depth of less than two were considered gaps. For the remaining positions, we incorporated the called variants; otherwise, the position was assumed to be the same as the reference. For the Calephelis, Calycopis, Phoebis, and Pterourus genera, we used the BWA-GATK pipeline to align genomic DNA reads to the reference genomes and derive a genomic sequence for each specimen. We then extracted the protein-coding sequences from these genomic sequences. Finally, we filtered the protein-coding sequences to keep only those that were covered for at least 60 bp by a minimum of two specimens from the western side and two from the eastern side of the suture zone. This process resulted in a final set of 10,675 to 16,494 non-redundant reference protein-coding sequences for each taxon pair.
The assembled protein-coding sequences included mitochondrial genes, such as the cytochrome c oxidase subunit I (COI) gene, which contains the standard barcode sequence used for species delineation. We extracted the COI barcode sequences for our specimens by mapping known barcodes from the Barcode of Life Data System (BOLD) to the reference protein-coding sequences of each taxon pair.
Designing Measurements to Separate Species from Non-species Pairs
Using the multiple sequence alignments (MSAs) for protein-coding sequences, we computed the fixation index (FST) and the index of gene flow (IGF) for each of the 25 taxon pairs. Both measures were calculated for all transcripts and for Z-linked transcripts only. We computed FST on both the DNA sequences and the protein sequences they encode. To avoid bias from varying sample sizes, we randomly sampled two specimens from each taxon at each position to compute FST. We computed FST using only positions where all four sampled specimens were present (non-gap) in the MSAs. FST was calculated as 1- πwithin/πbetween. πwithin was the fraction of positions where the two specimens from the same taxon had different nucleotides (or amino acids for protein-level FST). πbetween was the fraction of positions where two specimens from different taxa had different nucleotides or amino acids. FST was calculated for either taxa, and the final FST for the pair was the average of these two values.
A high FST value can sometimes be misleading, because it may arise from low intra-taxon divergence caused by bottleneck or sampling of close relatives, rather than from true speciation. To overcome this limitation, we estimated gene flow using Gmin, a measure that detects introgression by identifying high haplotype similarity between specimens from different taxa. To mitigate potential bias from sample size on Gmin values, for each gene/transcript, we randomly sampled two specimens from each taxon (four specimens for each pair) to calculate Gmin.
We computed Gmin in non-overlapping sliding windows of 1 kb, and coding sequences shorter than 1 kb were excluded. Within each window, we computed the minimal Hamming distance (dmin) between any two specimens from different taxa and the average Hamming distance (davr) between any pair of specimens. We filtered out windows where davr was less than 2, and the remaining number of windows was designated as Ntotal. For each window, we computed Gmin as dmin/davr. We considered Gmin lower than 0.25 to indicate introgression in a window, and the total number of such windows was designated as Nintro. Since introgression in any of the four analyzed specimens could lead to a low Gmin, we used IGF = Nintro / (4 * Ntotal) to reflect the average fraction of introgressed windows. We found that while changing the Gmin cutoff from 0.1 (more stringent) to 0.6 (less stringent) altered the absolute values of IGF, it did not affect the observed distinction between species and conspecific pairs.
To test candidate species pairs predicted by these genomic measures, we collected sympatric specimens for two pairs: Libytheana carinenta bachmanii and L. carinenta larvata, and Burnsius communis and B. albenzens. We obtained genomic sequences from these specimens, mapped the reads to the reference genomes we had for L. carinenta and B. communis, and derived the genomic sequence for each specimen using the BWA-GATK pipeline. To validate our findings, we repeated the calculations for the FST and the IGF on these sympatric population pairs as described above.
Testing Other Approaches for Species Delineation
We used the Bayesian Phylogenetics and Phylogeography (BPP) software (version 4.7.0) to test the species delimitation of our candidate pairs based on a multi-species coalescent model. For each taxon pair, we randomly sampled 100 loci from our MSAs of Z-linked genes, with each locus ranging from 250 to 500 bp. We ensured that each selected locus contained at least 200 non-gap positions with data for a minimum of two specimens per taxon. The MSAs for 100 Z-linked loci, respectively, were used as inputs for BPP.
We specified priors for θ and τ using the same algorithms as in Minimalist BPP, the web server recommended in the BPP tutorial for obtaining sensible default priors. We initiated 5 independent Markov chains with different random seeds. Each chain was run with the parameters sampfreq = 5 and nsample = 500,000. To ensure convergence, the first 20 % of samples were discarded as burn-in, and convergence across chains was evaluated using the Gelman–Rubin diagnostic (cutoff < 1.01) with effective sample sizes (ESS) > 200. Chains that met these criteria were used to calculate the Genealogical Divergence Index (GDI) based on the posterior estimates of τ and θ from the Markov Chain Monte Carlo samples generated by BPP.
Identifying Divergence Hotspots between Species
We hypothesized that proteins with significant divergence between species and low polymorphism within each species are more likely to be involved in speciation. We named these "divergence hotspots" and identified them in each species pair based on two criteria. First, we calculated FSTfor each protein, and selected the top 20 % of proteins with the highest FST values. Second, from this set, we identified proteins that were significantly enriched in residues that diverged between the two species. We only considered positions present in at least two specimens of each species, denoting the total number of such positions in protein i as Ntoti. Among these, we identified divergent positions where the dominant amino acids (in more than 75 % of specimens) of the two species differed. The total number of such divergent positions in protein i was denoted as Ndivi. The enrichment was quantified using a binomial test from the Python scipy module for each protein, with the following parameters: p = (∑ Ndivi) / (∑ Ntoti), i.e., average rate of divergent positions across all proteins; x = Ndivi, n = Ntoti.
To identify recurring divergence hotspots, we mapped the reference transcripts/protein-coding sequences for each species pair to Drosophila proteins in Flybase by identifying the top BLASTX hits. We only considered Flybase entries that were mapped to by reference sequences from a majority (at least 8 out of 15) of these species pairs. We then counted how many times each Flybase entry was among the divergence hotspots across all 15 species pairs. If a gene appeared significantly more often than would be expected by chance, it was considered a recurring divergence hotspot. To establish a statistical threshold for this significance, we performed a random simulation with 10,000 repetitions. In each simulation, we randomly selected a number of genes for each species pair equal to the actual number of divergence hotspots we identified in that pair. For each gene, we then selected the 5 % of simulations showing the highest frequency for that gene. The lowest frequency of observing a gene among these selected simulations was used as the cutoff for an observed frequency to be significantly higher than random (P < 0.05).
We identified enriched Gene Ontology (GO) terms associated with the recurring divergence hotspots using a binomial test (x = the number of recurring divergence hotspots associated with this GO term, n = the total number of recurring divergence hotspots, p = the probability for this GO term to be associated with any Flybase entry being analyzed). Similar analyses were performed on the positively selected recurring divergence hotspots and all the Z-linked genes. The list of enriched GO terms and their p-values was analyzed using REVIGO, a tool that summarizes and visualizes long, redundant lists of terms. REVIGO produced a network representing these terms, linking those that tend to be associated with the same proteins. We then manually arranged this network in Cytoscape to place semantically similar GO terms in close proximity.
Detecting Positively Selected Genes between Species
We mapped all reference transcripts/protein-coding sequences for each species pair to their respective Drosophila genes in Flybase as described above. We then concatenated the MSAs for multiple references from a species pair that mapped to the same Drosophila gene. We tested whether a gene was positively selected during the divergence of species pairs in our study using the McDonald–Kreitman (MK) test. This test compares the ratio of nonsynonymous (DN) to synonymous (DS) substitutions between species with the ratio of nonsynonymous (PN) to synonymous (PS) polymorphisms within species.
For each gene, we first estimated the number of within-species polymorphisms. For every pair of specimens within a species, we counted the number of PN and PS substitutions to change from the sequence of one specimen to another. We chose the most parsimonious path with the fewest nonsynonymous changes to account for multiple substitution paths. These values were summed across all specimens within the 15 pairs to obtain the total number of PN and PS polymorphisms for each gene. Next, we estimated the number of inter-species divergences. For each of the 15 species pairs, we counted the number of DN and DS substitutions to change from the consensus codons of one species to those of the other, again favoring the most parsimonious path. We summed these values over all 15 species pairs to obtain the total number of DN and DS divergences for each gene. Finally, to calculate statistical significance for positive selection in each gene, we used Fisher’s exact test to compare the ratio of DN / DS with the ratio of PN /PS. A gene was considered to show a significant sign of positive selection if its P-value was less than 0.05.
