Data from: Whole genomes reveal evolutionary relationships and mechanisms underlying gene-tree discordance in Neodiprion sawflies
Data files
Jan 06, 2023 version files 5.54 GB
-
100kb_genome.nex
52.54 KB
-
10kb_genome.nex
481.82 KB
-
1kb_genome.nex
4.25 MB
-
50kb_genome.nex
102.94 KB
-
5kb_genome.nex
929.78 KB
-
abbotii_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
allkb_genome.nex
274.65 MB
-
autumnalis_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
compar_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
dubiosus_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
excitans_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
fabricii_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
hetricki_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
knereri_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
maurus_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
merkeli_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
nigroscutum_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
pinetum_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
pinusrigidae_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
pratti_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
README.md
3.98 KB
-
rugifrons_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
swainei_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
taedae_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
virginianus_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
warreni_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
May 14, 2024 version files 10.52 GB
-
100kb_genome.nex
52.54 KB
-
10kb_genome.nex
481.82 KB
-
1kb_genome.nex
4.25 MB
-
50kb_genome.nex
102.94 KB
-
50kb.zip
3.86 GB
-
5kb_genome.nex
929.78 KB
-
abbotii_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
allkb_genome.nex
274.65 MB
-
autumnalis_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
compar_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
dubiosus_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
excitans_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
fabricii_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
filtered_transcripts.zip
676.83 MB
-
GCF_021901455.1_iyNeoLeco1.1_genomic.fa
275.49 MB
-
GCF_021901455.1_iyNeoLeco1.1_genomic.gff
172.75 MB
-
hetricki_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
knereri_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
maurus_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
merkeli_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
nigroscutum_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
pinetum_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
pinusrigidae_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
pratti_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
README.md
4.80 KB
-
rugifrons_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
sCF_genomestats_50kbwindows.csv
2.80 MB
-
swainei_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
taedae_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
virginianus_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
-
warreni_iupac_it5.depth_withnucleotide_rd4totop1percentorN.fa
276.66 MB
Abstract
Rapidly evolving taxa are excellent models for understanding the mechanisms that give rise to biodiversity. However, developing an accurate historical framework for comparative analysis of such lineages remains a challenge due to ubiquitous incomplete lineage sorting and introgression. Here, we use a whole-genome alignment, multiple locus-sampling strategies, and summary-tree and SNP-based species-tree methods to infer a species tree for eastern North American Neodiprion species, a clade of pine-feeding sawflies (Order: Hymenopteran; Family: Diprionidae). We recovered a well-supported species tree that—except for three uncertain relationships—was robust to different strategies for analyzing whole-genome data. Nevertheless, underlying gene-tree discordance was high. To understand this genealogical variation, we used multiple linear regression to model site concordance factors estimated in 50-kb windows as a function of several genomic predictor variables. We found that site concordance factors tended to be higher in regions of the genome with more parsimony-informative sites, fewer singletons, less missing data, lower GC content, more genes, lower recombination rates, and lower D-statistics (less introgression). Together, these results suggest that incomplete lineage sorting, introgression, and genotyping error all shape the genomic landscape of gene-tree discordance in Neodiprion. More generally, our findings demonstrate how combining phylogenomic analysis with knowledge of local genomic features can reveal mechanisms that produce topological heterogeneity across genomes.
README: Title of Dataset:
Data From: Whole Genomes Reveal Evolutionary Relationships and Mechanisms Underlying Gene-Tree Discordance in Neodiprion Sawflies
This project uses whole-genome data to infer evolutionary relationships in pine sawflies in the genus Neodiprion and understand the causes of gene-tree discordance. The data consist of a new chromosome-level assembly for one species (Neodiprion lecontei) and resequencing data for 18 additional species in the eastern North American Lecontei clade (N. abbotii, N. compar, N. dubious, N. excitans, N. fabricii, N. hetricki, N. knereri, N. maurus, N. merkeli, N. nigroscutum, N. pinetum, N. pinusrigidae, N. pratti, N. rugifrons, N. swainei, N. taedae, N. virginianus, and N. warreni) and a western North American outgroup species, Neodiprion autumnalis.
To produce datasets for phylogenetic analysis, pseudo-genomes (reference based genome assemblies in N. lecontei physical coordinates) were produced for each of the 19 species other than N. lecontei. The N. lecontei genome is available on NCBI (iyNeoLeco1.1 RefSeq GCF_021901455.1) and also provided here in FASTA format. The remaining 19 pseudo-genomes are provided here in FASTA format, with each species having end-to-end sequences for each of seven chromosomes, all in the same physical coordinate space as the N. lecontei genome (equivalent to a whole-genome alignment).
We then used these genomes to produce phylogenomic datasets of several types:
- a windowed dataset in which the genomes were sliced into non-overlapping windows of 50kb
- a gene-based dataset in which the N. lecontei gene annotation was used to pull out coding regions in all the genomes
- SNP-based datasets in which all bi-allelic SNPs were called across the 20 species (N. lecontei + 19 other Neodiprion), then filtered with various spacing requirements (none=all snps, every 1kb, 5kb, 10kb, 50kb, and 100kb).
The widow-based and gene-based datasets produced thousands of nexus files, which are provided as zip archives (50kb.zip and filtered_transcripts.zip). The 6 SNP datasets are provided as 6 nexus files. All scripts used to generate (a) pseudo-reference genomes, (b) windowed dataset, (c) gene dataset, and (d) snp datasets are provided as four zip archives that each include a read me file.
We also summarized genomic variables in each window to explore predictors of site concordance factors. We then used a multiple regression approach to see which variables predicted discordance with the species trees. A csv file with scf for each node in the tree and each predictor variable is provided, as is the R code used to paint chromosomes by scf and perform multiple regression analyses.
Description of the Data and file structure
There are 4 types of files uploaded to Dryad:
1.Fasta files for each of the 19 "pseudo-reference" Neodiprion genomes, plus the reference Neodiprion lecontei genome (end in ".fa"). We also include a gene annotation file for the reference genome (ends in "gff").
- for pseudo-reference genomes, the first term in the title is the species name.
- All files have been filtered in the same way (indiciated by remaining terms in the file names):
- heterozygous sites are coded using IUPAC sequencing codes
- fifth iteration of the analysis pipeline
- mimimum read depth of 4
- unusually high read depth (top 1%) coded as missing data
2.Nexus files for each of 6 SNP datasets (end in: ".nex")
- first term in file indicates any required spacing between SNPs:
- For all datasets, we excluded SNPs that were absent in more than 10% of the species and that had more than 2 alleles
- allkb_genome.nex: All SNPs included (no minimum spacing) _ 1kb_genome.nex: at least 1kb between SNPs
- 5kb_genome.nex: at least 5kb between SNPs
- 10kb_genome.nex: at least 10kb between SNPs
- 50kb_genome.nex: at least 50kb between SNPs
- 100kb_genome.nex: at least 100kb between SNPs
3.Zipped files containing alignments cut into 50kb windows (50kb.zip) and genes (filtered_transcripts.zip)- these can be used for downstream analyses in IQtree and ASTRAL
4.CSV file containing site concordance factors for each branch in a reference species tree and several genomic predictor variables.
Sharing/access Information
Links to other publicly accessible locations of the data:
N. lecontei reference genome: https://www.ncbi.nlm.nih.gov/genome/39861?genome_assembly_id=1780404
Code for generating datasets from pseudo-reference genomes: https://github.com/LinnenLab/Herrig_etal_NeodiprionPhylogeny
Was data derived from another source? No
If yes, list source(s): N/A
Methods
DNA was extracted from field-caught larvae. Then, Illumina libraries were prepared and sequenced on an Illumina NextSeq 500 with PE150 reads, which produced 14-27 million reads per individual.
To obtain a multi-genome alignment, we used a pseudo-reference-based approach, with an annotated, reference quality N. lecontei genome (iyNeoLeco1.1 RefSeq GCF_021901455.1) serving as the reference. Briefly, we first used bowtie2 v2.4.1 to map reads from each species to the N. lecontei reference genome. To allow for divergence between reads and the N. lecontei reference, we initially allowed a mismatch in the seed and “local” mapping options in bowtie2. New variants (excluding indels) were incorporated using samtools v1.10 and bcftools v1.10.2. In a second round of mapping, this process was repeated using the first iteration of the genome for each species as the new reference genome. The third round of mapping removed the seed mismatch. The fourth and fifth iterations required end-to-end mapping. After the fifth iteration, we replaced any nucleotide that had a read depth less than 4 or that had excessively high mapping depth (highest 1% of depths for each species) with an “N” using a custom script. All bioinformatics commands and scripts can be found on the LinnenLab GitHub page under the Herrig_etal_NeodiprionPhylogeny repository (https://github.com/LinnenLab/Herrig_etal_NeodiprionPhylogeny and Zenodo). This approach produced FASTA files for each species, with genome sequences in N. lecontei coordinates. All FASTA files are provided here.
We next used the FASTA files to produce additional datasets for analysis. First, we used bedtools v2.30.0 to divide the seven Neodiprion chromosomes into non-overlapping windows of 50 kb. Second, to approximate a dataset of protein-coding genes analogous to an RNAseq or exon-capture phylogenomic dataset, we used gffread v0.11.7 with the –w flag to write fasta files with spliced exons for each transcript for each species using the NCBI Neodiprion lecontei Annotation Release (iyNeoLeco1.1 RefSeq GCF_021901455.1). Windowed and gene datasets are provided as nexus files that contain individual window/gene alignments. These can also be regenerated using scripts (available on GitHub: https://github.com/LinnenLab/Herrig_etal_NeodiprionPhylogeny and Zenodo) to cut up the genome into desired loci (windows or genes) and convert these to nexus format. Third, we called single nucleotide polymorphisms (SNPs) across the entire genome using SNP-sites v2.5.1. We then filtered the data to exclude SNPs that were absent in more than 10% of species and sites with more than two alleles. In addition to analyzing all SNPs (which likely contain tightly linked sites), we produced additional datasets with one SNP sampled every 1 kb, 5 kb, 10 kb, 50 kb, or 100 kb using SNP-sites, with more sparsely sampled SNPs on par with a dataset that might be generated via RADseq. We transformed each of the six datasets into nexus format. All six SNP nexus files are provided and can be used as input for SVDquartets.
To investigate sources of phylogenetic discordance, we also generated estimates of site concordance factors in 50-kb windows and estimated 7 genomic predictor variables in for these same 50-kb windows, including: # parsimony informative sites, # singletons, proportion missing data, GC content, D-statistics, gene density, and recombination rate. This dataset is available as a csv file.
Usage notes
All files are either in FASTA or NEXUS format. FASTA format is a standard text format for nucleotide sequences. FASTA genome files are provided for each Neodiprion species. Using freely available scripts (https://github.com/LinnenLab/Herrig_etal_NeodiprionPhylogeny), these can be used to produce window-based and gene-based datasets in nexus format. Nexus is a standard format for character data for phylogenetic analysis. These can be used as input for many different phylogenetic programs.