Data from: Whole genomes reveal evolutionary relationships and mechanisms underlying gene-tree discordance in Neodiprion sawflies
Cite this dataset
Linnen, Catherine et al. (2023). Data from: Whole genomes reveal evolutionary relationships and mechanisms underlying gene-tree discordance in Neodiprion sawflies [Dataset]. Dryad. https://doi.org/10.5061/dryad.bg79cnpf7
Rapidly evolving taxa are excellent models for understanding the mechanisms that give rise to biodiversity. However, developing an accurate historical framework for comparative analysis of such lineages remains a challenge due to ubiquitous incomplete lineage sorting and introgression. Here, we use a whole-genome alignment, multiple locus-sampling strategies, and locus-based and SNP-based species-tree methods to infer a species tree for eastern North American Neodiprion species, a clade of pine-feeding sawflies (Order: Hymenopteran; Family: Diprionidae). We recovered a well-supported species tree that—except for three uncertain relationships—is robust to different strategies for analyzing whole-genome data. Despite this consistency, underlying gene-tree discordance is high. To understand this discordance, we use multiple regression to model topological discordance as a function of several genomic features. We find that gene-tree discordance tends to be higher in regions of the genome that may be more prone to gene-tree estimation error, as indicated by a lower density of parsimony-informative sites, a higher density of genes, a higher average pairwise genetic distance, and gene trees with lower average bootstrap support. Also, contrary to the expectation that discordance via incomplete lineage sorting is reduced in low-recombination regions of the genome, we find a negative correlation between recombination rate and topological discordance. We offer potential explanations for this pattern and hypothesize that it may be unique to lineages that have diverged with gene flow. Our analysis also reveals an unexpected discordance hotspot on Chromosome 1, which contains several genes potentially involved in mitochondrial-nuclear interactions and produces a gene-tree that resembles a highly discordant mitochondrial tree. Based on these observations, we hypothesize that our genome-wide scan for topological discordance has identified a nuclear locus involved in a mito-nuclear incompatibility. Together, these results demonstrate how phylogenomic analysis coupled with high-quality, annotated genomes can generate novel hypotheses about the mechanisms that drive divergence and produce variable genealogical histories across genomes.
DNA was extracted from field-caught larvae. Then, Illumina libraries were prepared and sequenced on an Illumina NextSeq 500 with PE150 reads, which produced 14-27 million reads per individual.
To obtain a multi-genome alignment, we used a pseudo-reference-based approach, with an annotated, reference quality N. lecontei genome (iyNeoLeco1.1 RefSeq GCF_021901455.1) serving as the reference. Briefly, we first used bowtie2 v2.4.1 to map reads from each species to the N. lecontei reference genome. To allow for divergence between reads and the N. lecontei reference, we initially allowed a mismatch in the seed and “local” mapping options in bowtie2. New variants (excluding indels) were incorporated using samtools v1.10 and bcftools v1.10.2. In a second round of mapping, this process was repeated using the first iteration of the genome for each species as the new reference genome. The third round of mapping removed the seed mismatch. The fourth and fifth iterations required end-to-end mapping. After the fifth iteration, we replaced any nucleotide that had a read depth less than 4 or that had excessively high mapping depth (highest 1% of depths for each species) with an “N” using a custom script. All bioinformatics commands and scripts can be found on the LinnenLab GitHub page under the Herrig_etal_NeodiprionPhylogeny repository (https://github.com/LinnenLab/Herrig_etal_NeodiprionPhylogeny). This approach produced FASTA files for each species, with genome sequences in N. lecontei coordinates. All FASTA files are provided here.
We next used the FASTA files to produce additional datasets for analysis. First, we used bedtools v2.30.0 to divide the seven Neodiprion chromosomes into non-overlapping windows of different sizes: 5 kb, 10 kb, 50 kb, 100 kb, 500 kb, and 1 Mb. Second, to approximate a dataset of protein-coding genes analogous to an RNAseq or exon-capture phylogenomic dataset, we used gffread v0.11.7 with the –w flag to write fasta files with spliced exons for each transcript for each species using the NCBI Neodiprion lecontei Annotation Release (iyNeoLeco1.1 RefSeq GCF_021901455.1). Windowed and gene datasets are not provided here because each consists of thousands of individual nexus files. Instead, these can be regenerated using scripts (available on GitHub: https://github.com/LinnenLab/Herrig_etal_NeodiprionPhylogeny) to cut up the genome into desired loci (windows or genes) and convert these to nexus format. Third, we called single nucleotide polymorphisms (SNPs) across the entire genome using SNP-sites v2.5.1. We then filtered the data to exclude SNPs that were absent in more than 10% of species and sites with more than two alleles. In addition to analyzing all SNPs (which likely contain tightly linked sites), we produced additional datasets with one SNP sampled every 1 kb, 5 kb, 10 kb, 50 kb, or 100 kb using SNP-sites, with more sparsely sampled SNPs on par with a dataset that might be generated via RADseq. We transformed each of the six datasets into nexus format. All six SNP nexus files are provided and can be used as input for SVDquartets.
All files are either in FASTA or NEXUS format. FASTA format is a standard text format for nucleotide sequences. FASTA genome files are provided for each Neodiprion species. Using freely available scripts (https://github.com/LinnenLab/Herrig_etal_NeodiprionPhylogeny), these can be used to produce window-based and gene-based datasets in nexus format. Nexus is a standard format for character data for phylogenetic analysis. These can be used as input for many different phylogenetic programs.
National Science Foundation, Award: DEB-CAREER-1750946
United States Department of Agriculture, Award: 2016-67014-2475