High-density genetic linkage mapping in Sitka spruce advances the integration of genomic resources in conifers
Data files
Jan 19, 2024 version files 148.68 MB
-
Integrated_Map_White_Sitka.csv
1.05 MB
-
RADChip_Genotype_Table.txt
49.51 MB
-
RADChip_Linkage_Map.csv
634.07 KB
-
RADChip_Map_Info.csv
1.34 MB
-
RADChip_Pedigree_vertical.csv
12.67 KB
-
RADSeq_Genotype_Table.txt
80.34 MB
-
RADSeq_Linkage_Map.csv
341.76 KB
-
RADSeq_Pedigree_vertical.csv
26.66 KB
-
README.md
17.68 KB
-
SNP_Chip_Data.csv
2.44 MB
-
SNPChip_Genotype_Table.txt
12.81 MB
-
SNPChip_Linkage_Map.csv
146.73 KB
-
SNPChip_Pedigree_vertical.csv
14.76 KB
Abstract
In species with large and complex genomes such as conifers, dense linkage maps are a useful for supporting genome assembly and laying the genomic groundwork at the structural, populational and functional levels. However, most of the 600+ extant conifer species still lack extensive genotyping resources, which hampers the development of high-density linkage maps. In this study, we developed a linkage map relying on 21,570 SNP makers in Sitka spruce (Picea sitchensis [Bong.] Carr.), a long-lived conifer from western North America that is widely planted for productive forestry in the British Isles. We used a single-step mapping approach to efficiently combine RAD-Seq and genotyping array SNP data for 528 individuals from two full-sib families. As expected for spruce taxa, the saturated map contained 12 linkages groups with a total length of 2,142 cM. The positioning of 5,414 unique gene coding sequences allowed us to compare our map with that of other Pinaceae species, which provided evidence for high levels of synteny and gene order conservation in this family. We then developed an integrated map for P. sitchensis and P. glauca based on 27,052 makers and 11,609 gene sequences. Altogether, these two linkage maps, the accompanying catalog of 286,159 SNPs and the genotyping chip developed herein opens new perspectives for a variety of fundamental and more applied research objectives, such as for the improvement of spruce genome assemblies, or for marker-assisted sustainable management of genetic resources in Sitka spruce and related species.
File generated 01-17-2024 by Hayley Tumas
#####GENERAL INFORMATION#####
Title: High-density genetic linkage mapping in Sitka spruce advances the integration of genomic resources in conifers
This is data and code accompanying the publication “High-density genetic linkage mapping in Sitka spruce advances the integration of genomic resources in conifers” in G3 (Tumas et al. 2024, Accepted, DOI pending).
PROJECT:
The project developed a high-density genetic linkage map for Sitka spruce using genotype data for two full sib families from the UK Sitka spruce breeding population. Samples were genotyped using both a SNP array developed for the project and RAD-Seq. These two genotype datasets were used separately to generate genetic maps and together in a single combined map using the software LepMap-3. This combined genetic map (RADChip map) was used along with a genetic map for white spruce (Pavy et al 2017) to create an integrated species map. The data includes information on the SNP array (SNP Chip Assay) developed for this project via SNP discovery using exome capture, genotype tables for two full sib families and accompanying pedigree files used to develop linkage maps, the final linkage map for Sitka spruce produced the project, and an integrated linkage map for Sitka and white spruce. All data files are in .csv or .txt. format. There is also code for converting genotype tables and pedigree files into Lep-MAP3 files for linkage map development using PLINK (.sh script) and R (.R file). More detailed information on SNP discovery, sample processing, genotyping, and linkage map development can be found in the accompanying publication.
DATE OF COLLECTION: 2017-2020
GEOGRAPHIC LOCATION: United Kingdom
ADDITIONAL DATA OR CODE:
Further code accompanying this data to produce results found in the publication can be found on GitHub at https://github.com/HayleyTumas/SitkaLinkageMap
CONTACT INFORMATION:
If there are any issues, errors, or questions regarding the data or accompanying code and analyses, please contact Hayley Tumas (htumas@gmail.com).
#########################################################################################################
####DATA OVERVIEW####
FILE LIST:
A) SNP_Chip_Data.csv
B) SNPChip_Genotype_Table.txt
C) RADSeq_Genotype_Table.txt
D) RADChip_Genotype_Table.txt
E) SNPChip_Pedigree_vertical.csv
F) RADSeq_Pedigree_vertical.csv
G) RADChip_Pedigree_vertical.csv
H) create_LepMap_file.R (code)
I) SNPtable_to_VCF.sh (code)
J) SNPChip_Linkage_Map.csv
K) RADSeq_Linkage_Map.csv
L) RADChip_Linkage_Map.csv
M) RADChip_Map_Info.csv
N) Integrated_Map_White_Sitka.csv
FILE RELATIONSHIPS:
- Data on the SNP Chip Assay developed: SNP_Chip_Data.csv
- Files and code to convert files for use in Lep-MAP3 to generate 3 linkage maps:
- Tables containing genotype data for offspring and parents from two full-sib families using the SNP Chip assay (SNP\Chip_Genotype_Table.txt), RAD-Seq (RADSeq_Genotype_Table.txt), and the combined RAD-Chip dataset (RADChip_Genotype_Table.txt).
- Pedigree data for individuals in each dataset in vertical format (SNPChip_Pedigree_vertical.csv, RADSeq_Pedigree_vertical.csv, RADChip_Pedigree_vertical.csv)
- Code to convert genotype tables and pedigree files for use in Lep-MAP3. Genotype tables need to be converted to vcfs. Begin using [create_LepMap_file.R] to convert “.txt” files to PLINK “.ped”” and “.map”” files. Then use [SNPtable_to_VCF.sh] to convert PLINK files to vcfs. Use [create_LepMap_file.R] to convert vertical pedigree files to files readable by Lep-MAP3. Proceed with these in Lep-MAP3 accompanying GitHub code.
- Final linkage maps produced using:
- the SNP Chip dataset (SNPChip_Linkage_Map.csv)
- the RAD-Seq dataset (RADSeq_Linkage_Map.csv)
- the combined RAD-Chip dataset (RADChip_Linkage_Map.csv)
- A secondary file (RADChip_Map_Info.csv) provides more information about the RAD_Chip map with columns giving synteny to three other conifer species (white spruce, Norway spruce, and limber pine) as per the accompanying publication
- Final integrated map of Sitka spruce and white spruce linkage maps (Integrated_Map_White_Sitka.csv). Information on map integration is in the accompanying publication and code for integration with the ?LPMerge? R package can be found in the GitHub repository.
Code to analyze this data can be found at: https://github.com/HayleyTumas/SitkaLinkageMap
#########################################################################################################
DATA SPECIFIC TO: SNP_Chip_Data.csv
-
Description: List of SNPs on SNP Chip array developed for project
-
Number of Variables: 9
-
Number of columns: 9
-
Number of rows: 12911
-
Variables:
*SNP_ID: Unique SNP marker identifier
*Picea_glauca_GCAT_SequenceID_v3.3_Rigaultetal2011: ID for white spruce gene catalog sequence in which SNP was discovered as published in Rigault et al 2011
*Mapped_in_Picea_glauca_Pavy2017: whether or now (Y:Yes/N:No) this gene catalog sequence was previously mapped in white spruce as published in Pavy et al 2017
*Sequence: sequence surrounding SNP
*Illumina_ID: unique identifier assigned by Illumina in generating SNP array
*Nb.Beads_Needed: number of beads needed to synthesize SNP on the array
*Illumina_Strand: whether the Illumina strans is top (TOP) or bottom (BOT)
*Polymorphism: SNP polymorphism that varies between samples -
Missing data code: None
#########################################################################################################
DATA SPECIFIC TO: SNPChip_Genotype_Table.txt
-
Description: SNP array genotype data for two full sib Sitka spruce families with SNPs in columns and individuals in rows. Note that first two rows contain SNP chromosome and position information for use in LEPMap3
-
Number of Variables: 4
-
Number of columns: 5534
-
Number of rows: 617
-
Variables:
*Columns: SNP IDs and accompanying heterozygote genotypes separated by “/”
*Rows: Individual samples from full sib families
*CHR: Dummy chromosome (1) data for all SNPs for use in LepMap3
*POS: Dummy position (1:5533) data for all SNPs for use in LepMap3 -
Missing data code: NA
#########################################################################################################
DATA SPECIFIC TO: RADSeq_Genotype_Table.txt
-
Description: RAD-Seq genotype data for two full sib Sitka spruce families with SNPs in columns and individuals in rows. Note that first two rows contain SNP chromosome and position information for use in LEPMap3
-
Number of Variables: 4
-
Number of columns: 20270
-
Number of rows: 1113
-
Variables:
*Columns: SNP IDs as column header and accompanying heterozygote genotypes separated by “/” for each sample
*Rows: Individual samples from full sib families
*CHR: Chromosome number for use in LepMap3 pulled from data generated in RAD-Seq pipeline
*POS: Position number for use in LepMap3 pulled from data generated in RAD-Seq pipeline, when combined with CHR its unqiue to each SNP -
Missing data code: NA
#########################################################################################################
DATA SPECIFIC TO: RADChip_Genotype_Table.txt
-
Description: RAD-Seq genotype data for two full sib Sitka spruce families with SNPs in columns and individuals in rows. Note that first two rows contain SNP chromosome and position information for use in LEPMap3
-
Number of Variables: 4
-
Number of columns: 25803
-
Number of rows: 530
-
Variables:
*Columns: SNP IDs as column header and accompanying heterozygote genotypes separated by “/” for each sample
*Rows: Individual samples from full sib families that were genotyped using both SNP Chip and RAD-Seq
*CHR: Chromosome number for use in LepMap3 either dummy generated or pulled from data generated in RAD-Seq pipeline
*POS: Position number for use in LepMap3 either dummy generated or pulled from data generated in RAD-Seq pipeline, when combined with CHR its unqiue to each SNP -
Missing data code: NA
#########################################################################################################
DATA SPECIFIC TO: SNPChip_Pedigree_vertical.csv
-
Description: Pedigree information for all samples that appear in SNPChip_Genotype_Table.txt that is needed for LEPMap3. NOTE: must be tranformed to horizontal format for use in LEPMap3
-
Number of Variables: 6
-
Number of columns: 6
-
Number of rows: 615
-
Variables:
*FAM: full sib family identifier (either LM1 for Family 1 or LM2 for Family 2)
*ID: individual sample ID, matches genotype table
*P1: Parent 1, father
*P2: Parent 2, mother
*Sex: sex of individual (1:male, 2:female), only assigned for parents as Sitka spruce is monoecious
*Phenotype: required by LepMap3, no phenotypes for this data so set to 1 -
Missing data code: none
#########################################################################################################
DATA SPECIFIC TO: RADSeq_Pedigree_vertical.csv
-
Description: Pedigree information for all samples that appear in RADSeq_Genotype_Table.txt that is needed for LEPMap3. NOTE: must be transformed to horizontal format for use in LEPMap3
-
Number of Variables: 6
-
Number of columns: 6
-
Number of rows: 1111
-
Variables:
*FAM: full sib family identifier (either LM1 for Family 1 or LM2 for Family 2)
*ID: individual sample ID, matches genotype table
*P1: Parent 1, father
*P2: Parent 2, mother
*Sex: sex of individual (1:male, 2:female), only assigned for parents as Sitka spruce is monoecious
*Phenotype: required by LepMap3, no phenotypes for this data so set to 1 -
Missing data code: none
#########################################################################################################
DATA SPECIFIC TO: RADChip_Pedigree_vertical.csv
-
Description: Pedigree information for all samples that appear in RADChip_Genotype_Table.txt that is needed for LEPMap3. NOTE: must be transformed to horizontal format for use in LEPMap3
-
Number of Variables: 6
-
Number of columns: 6
-
Number of rows: 528
-
Variables:
*FAM: full sib family identifier (either LM1 for Family 1 or LM2 for Family 2)
*ID: individual sample ID, matches genotype table
*P1: Parent 1, father
*P2: Parent 2, mother
*Sex: sex of individual (1:male, 2:female), only assigned for parents as Sitka spruce is monoecious
*Phenotype: required by LepMap3, no phenotypes for this data so set to 1 -
Missing data code: none
#########################################################################################################
DATA SPECIFIC TO: SNPChip_Linkage_Map.csv
-
Description: Final linkage map produced using both full-sib families and only SNP array genotypes (only places SNP Chip markers). Only markers that were assigned a position to the linkage map in LepMap3 appear in the file (i.e. number of assigned markers is less than those in genotype table, unassigned markers are excluded here)
-
Number of Variables: 3
-
Number of columns: 3
-
Number of rows: 5064
-
Variables:
*SNP_ID: unique identifier of SNP marker positioned on linkage map
*Linkage_Group: number of the chromosome that the marker was assigned to in linkage map
*Position_cM: position on chromosome (in centimorgans) that the marker was assigned in linkage map -
Missing data code: none
#########################################################################################################
DATA SPECIFIC TO: RADSeq_Linkage_Map.csv
-
Description: Final linkage map produced using both full-sib families and only RAD-Seq genotypes (only places RAD-Seq markers). Only markers that were assigned a position to the linkage map in LepMap3 appear in the file (i.e. number of assigned markers is less than those in genotype table, unassigned markers are excluded here)
-
Number of Variables: 3
-
Number of columns: 3
-
Number of rows: 15041
-
Variables:
*SNP_ID: unique identifier of SNP marker positioned on linkage map
*Linkage_Group: number of the chromosome that the marker was assigned to in linkage map
*Position_cM: position on chromosome (in centimorgans) that the marker was assigned in linkage map -
Missing data code: none
#########################################################################################################
DATA SPECIFIC TO: RADChip_Linkage_Map.csv
-
Description: Final linkage map produced using both full-sib families and both SNP Chip array and RAD-Seq genotypes. Only markers that were assigned a position to the linkage map in LepMap3 appear in the file (i.e. number of assigned markers is less than those in genotype table, unassigned markers are excluded here)
-
Number of Variables: 4
-
Number of columns: 4
-
Number of rows: 21570
-
Variables:
*SNP_ID: unique identifier of SNP marker positioned on linkage map
*Linkage_Group: number of the chromosome that the marker was assigned to in linkage map
*Position_cM: position on chromosome (in centimorgans) that the marker was assigned in linkage map
*Picea_glauca_GCAT_SequenceID_v3.3_Rigaultetal2011: identifier (based on Rigault et al 2011 publication) of white spruce gene catalog sequence that contains the mapped SNP. Sequence information for SNP Chip markers determined during SNP discovery, information for RAD-Seq makers determined through BLAST analysis -
Missing data code: none
#########################################################################################################
DATA SPECIFIC TO: RADChip_Map_Info.csv
-
Description: Further data and information for final linkage map produced using both full-sib families and both SNP Chip array and RAD-Seq genotypes.
-
Number of Variables: 13
-
Number of columns: 13
-
Number of rows: 21570
-
Variables:
*SNP_ID: unique identifier of SNP marker positioned on linkage map
*Linkage_Group: number of the chromosome that the marker was assigned to in the Sitka spruce linkage map
*Position_cM: position on chromosome (in centimorgans) that the marker was assigned in the Sitka spruce linkage map
*Picea_glauca_GCAT_SequenceID_v3.3_Rigaultetal2011: identifier (based on Rigault et al 2011 publication) of white spruce gene catalog sequence that contains the mapped SNP. Sequence information for SNP Chip markers determined during SNP discovery, information for RAD-Seq makers determined through BLAST analysis\
*SNPID_Piceaglauca: ID of corresponding SNP mapped in white spruce (published in Pavy et al 2017)
*Linkage_Group_Pglauca: number of the chromosome corresponding marker was assigned in the white spruce map
*Position_Pglauca: position on chromosome (in centimorgans) marker was assigned in the white spruce map
*SNPID_Piceaabies: ID of corresponding marker mapped in Norway spruce (published in Bernhardsson et al 2019)
*Linkage_group_Pabies: number of the chromosome corresponding marker was assigned in the Norway spruce map
*Position_Pabies: position on chromosome (in centimorgans) corresponding marker was assigned in the Norway spruce map
*TranscriptID_Pinusflexilis: ID of the corresponding transcriptome sequence mapped in limber pine (published in Liu et al 2019)
*Linkage_group_Plexilis: number of the chromosome corresponding transcriptome sequence was assigned in the limber pine map
*Position_Pflexilis: position on chromosome (in centimorgans) corresponding transcriptome was assigned in the limber pine map -
Missing data code: none
#########################################################################################################
DATA SPECIFIC TO: Integrated_Map_White_Sitka.csv
-
Description: Final integrated species linkage map that combines the linkage maps developed for white spruce and Sitka spruce, generated using the R package LPMerge from the RADChip map developed in this project and the white spruce map developed in Pavy et al 2013. Note that a marker was present in either species map if a position is given in its relative position column.
-
Number of Variables: 7
-
Number of columns: 7
-
Number of rows: 27052
-
Variables:
*Marker_ID: unique identifier of SNP marker positioned on the integrated linkage map
*Consensus_Linkage_Group: number of the chromosome that the marker was assigned to in the integrated species map
*Consensus_Position_cm: position on chromosome (in centimorgans) that the marker was assigned in the integrated species map
*Position_Sitka_Map: position on chromosome (in centimorgans) that the marker was assigned in the Sitka spruce map. If the marker was not mapped in the Sitka spruce map then NA is given
*Position_White_Map_Pavyetal2017: position on chromosome (in centimorgans) that the marker was assigned in the white spruce map. If the marker was not mapped in the white spruce map then NA is given
*Position_Overlap_Marker_Map: position on chromosome (in centimorgans) that the marker was assigned in a map generated in LPMerge from only markers found in both species. If the marker was not mapped in both species then NA is given.
*Best_LPMerge_Interval: LPMerge generates multiple possible maps for each chromosome with an associated maximum likelihood. Each generated map is output in an interval. This gives the interval of the selected chromosome map for each chromosome based on maximum likelihood and similarity to each species map. -
Missing data code: NA
The data included in this dataset is genotypic data for two full-sib families of Sitka spruce (Picea sitchensis) in the United Kingdom and resulting linkage maps for the species. Samples for DNA extraction and genotyping were collected from two full-sib genetic field trials as described in the accompanying publication. A SNP Chip array was developed for this work using exome capture. A subset of the samples had been genotyped using RAD Seq from a previous project (Fuentes-Utrilla et al 2017). The dataset includes information on the SNP array developed for the project and genotype data that has been filtered for missingness and minor allele frequency. Final results are in the form of linkage maps stored in csv files. Further information on collection methods and processing are detailed in the accompanying manuscript and scripts for data processing are available on GitHub (https://github.com/HayleyTumas/SitkaLinkageMap).
All files should be able to be opened using open access, freely available software. All tabular data are CSV or text files for the larger genotype data files. Code files are stored as bash script and can be opened using any text editor or in .R files that can be opened using the freely available R software.