Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species
Data files
Jul 18, 2021 version files 2.27 GB
-
DF_i52_filtered_concatenated_snps_max-missing_table_biallelic-only_p000_translated.txt.gz
163.44 MB
-
DF_mega-varscan_all_bedfiles_SNP_paralog_snps.txt.gz
7.81 MB
-
DF_p52-varscan_all_bedfiles_SNP.txt.gz
42.15 MB
-
DF_snc_trans.fna.gz
12.23 MB
-
haploid_pipeline_datatable.txt
1.58 KB
-
JP_i101_filtered_concatenated_snps_max-missing_table_biallelic-only_translated.txt.gz
39.61 MB
-
JP_pooled-varscan_all_bedfiles_SNP_translated.txt.gz
1.98 GB
-
JP_RFmg7-varscan_all_bedfiles_SNP_paralog_snps_translated.txt.gz
1 MB
-
JP_transcriptome_for_probe.fasta.gz
25.71 MB
-
pooled_individual_pipeline_datatable.txt
14.05 KB
-
README.txt
1.81 KB
-
rna-seq_biosample.txt
1.07 KB
-
rna-seq_sra_doc.txt
1.25 KB
-
testdata_biosample.txt
12.42 KB
-
testdata_sra_doc.txt
20.56 KB
Abstract
Despite their suitability for studying evolution, many conifer species have large and repetitive giga-genomes (16-31Gbp) that create hurdles to producing high coverage SNP datasets that capture diversity from across the entirety of the genome. Due in part to multiple ancient whole genome duplication events, gene family expansion and subsequent evolution within Pinaceae, false diversity from the misalignment of paralog copies creates further challenges in accurately and reproducibly inferring evolutionary history from sequence data. Here, we leverage the cost-saving benefits of pool-seq and exome-capture to discover SNPs in two conifer species, Douglas-fir (Pseudotsuga menziesii var. menziesii (Mirb.) Franco, Pinaceae) and jack pine (Pinus banksiana Lamb., Pinaceae). We show, using minimal baseline filtering, that allele frequencies estimated from pooled individuals show a strong positive correlation with those estimated by sequencing the same population as individuals (r > 0.948), on par with such comparisons made in model organisms. Further, we highlight the utility of haploid megagametophyte tissue for identifying sites that are likely due to misaligned paralogs. Together with additional minor filtering, we show that it is possible to remove many of the loci with large frequency estimate discrepancies between individual and pooled sequencing approaches, improving the correlation further (r > 0.973). Our work addresses bioinformatic challenges in non-model organisms with large and complex genomes, highlights the use of megagametophyte tissue for the identification of paralog sites, and suggests the combination of pool-seq and exome capture to be robust for further evolutionary hypothesis testing in these systems.
This data was collected from natural populations. Exome-capture pool-seq (20 diploid individuals), individually sequenced (the same diploid individuals in pools), and haploid megagametophyte data from a single individual was sequenced on an Illumina HiSeq4000 instrument at Centre d'expertise et de services Génome Québec, Montréal, Canada. Sequence data was processed according to bioinformatic best practices.
All code to analyze these files is also attached.
Each file of code is saved as jupyter notebook format (.ipynb) and as .html. HTML can be used to view the notebook without launching a jupyter kernel.