Integrating SNP, RNASeq, and phenotypic data from sorghum association and diversity panels for comprehensive genomic analysis
Data files
May 27, 2026 version files 1.19 GB
-
1.SAP_2020_Field_Phenotype.zip
97.56 KB
-
2.SDP_SAP_f2021_field_phenotype.zip
104.18 KB
-
3.SAP_2022_field_phenotype_Alabama_Location.zip
20.90 KB
-
4A.SAP_2023_field_phenotype.csv.gz
33.54 KB
-
4B.SAP_2023_GH_phenotype.csv.gz
18.05 KB
-
5.SAP_2024_field_Phenotype.csv.gz
16.04 KB
-
A1.SAP_SNPs_v5.vcf.gz
993.70 MB
-
A2.SAP_InDel_v5.vcf.gz
152.44 MB
-
B.SDP_SAP_815_maf001_het10_imputed.recode.vcf.gz
40.27 MB
-
README.md
16.88 KB
Abstract
Sorghum bicolor plays a critical role in both agricultural productivity and food security around the globe. The crop has been grown in both Africa and Asia for thousands of years and is now produced across six continents. Multiple diverse panels of sorghum lines have been assembled for use in both population genetics and quantitative genetics research with one of the most widely used being the Sorghum Association Panel (SAP), consisting of approximately 400 sorghum lines from around the globe carrying introgressions of flowering time and dwarfing genes to adapt them to temperate climates and one of the largest being the SbDiv panel, which consists of >700 sorghum lines from the around the globe similarly adapted to grow in temperate climates. This data repository provided genetic marker data based on whole genome resequencing for the SAP panel aligned to the BTx623 v5 reference genome, genetic marker data called from RNA-seq for the SbDiv panel aligned to the BTx623 v5 reference genome, transcript abundance data for a large majority of both panels measured in mature leaf tissue, and a set of phenotypes scored for these populations across high and low nitrogen treatments in five field experiments conducted between 2020 and 2024.
https://doi.org/10.5061/dryad.3j9kd51w9
Integrating SNP, RNASeq, and Phenotypic data from sorghum association and diversity panels for comprehensive genomic analysis
Sai Subash M.V.S1,2, Harshitha Mangal1,2, Nikee Shrestha1,2, Gen Xu1,2, James Schnable1,2, Jinliang Yang1,2
1 Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, Nebraska, USA.
2 Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, Nebraska, USA.
Description of the data and file structure
Genetic Marker Data
SAP Genetic Markers (Requencing)
The two VCF files (A1.SAP_SNPs_v5.vcf.gz, and A2.SAP_InDel_v5.vcf.gz) contain a new set of 25.6 million genetic markers (22.6 million SNPs and 3 million InDels) discovered and scored relative to the sorghum BTx623 v5 reference genome (McCormick et al. 2018). These genomic variants were called using the set of resequencing data for 400 sorghum genomes originally published in Boatwright et al. 2022. The raw sequencing data was downloaded from the NCBI SRA and then trimmed to remove low-quality reads and sequence regions as well as adapter sequences using fastp (Chen et al. 2018). The quality trimmed reads were aligned to the updated sorghum reference genome BTx623 v5 using BWA-mem (Li et al. 2013). Next, GATK (Genome Analysis Tool kit) was used to perform variant calling, resulting in 28.5 million SNPs and 6.6 million InDels. The raw genetic marker file was filtered to remove variants with a minor allele frequency less than 0.001 (0.1%), a missing rate of greater than 0.3 (30%), or a frequency of heterozygous genotype calls greater than 0.1(10%), resulting in the final set of 25.6 million genomic variants.
SbDiv Genetic Markers (RNASeq):
This VCF file (B.SDP_SAP_815_maf001_het10_imputed.recode.vcf.gz) contains a set of 277,724 SNP markers and 68,498 InDels scored relative to the BTx623 v5 sorghum reference genome and called using RNA-seq from 815 sorghum lines collected from mature leaf tissue. RNA-seq reads were aligned to the BTx623 v5 sorghum reference genome using STAR (Dobin et al. 2013). Markers with a minor allele frequency of less than 0.01, or heterozygous genotype calls with more than 10% were removed using bcftools (Danecek et al. 2021).
Phenotypic Dataset:
Year 2020
This spreadsheet (1.SAP_2020_Field Phenotype.zip) includes plot-level phenotype data collected from a field experiment conducted in Lincoln, Nebraska, in 2020 with 390 sorghum genotypes entered in a three replicate x 2 treatment (high (optimal) nitrogen and low nitrogen) design. The majority of the first two blocks were lost during the field season with most phenotypes available only for two blocks each of the high and low nitrogen treatments. More details on both the field design and layout as well as the protocols used to measure different phenotypes are provided in Grzybowski et al 2022.
Year 2021
This spreadsheet (2.SDP _SAP_f2021_field_phenotype.zip) includes plot level phenotype data collected from a field experiment conducted in Lincoln Nebraska in 2021, which included two replicates of the sorghum association panel (approximately 390 entries) grown under low nitrogen conditions and two replicates of the sorghum association panel plus the sorghum diversity panel (approximately 850 entries) grown under high (optimal) nitrogen conditions. More details on both the field design and layout as well as the protocols used to collect data are provided as part of Shrestha et al 2025 and Mangal et al 2025.
Year 2022
(3.SAP_ 2022_field_phenotype_Alabama Location.zip) Measurements of flowering time collected from the Sorghum Association Panel at AAMU near Huntsville Alabama in 2022. Approximately 390 entries were grown in a total of six blocks, three of low nitrogen and three of high nitrogen. See Mangal et al 2025 for more details.
Year 2023
This file (4B.SAP_2023_GH_phenotype.csv.gz) contains seedling phenotypic data collected from a subset of genotypes (n = 340) in the sorghum association panel. The seedlings were grown in a greenhouse under high-nitrogen and low-nitrogen treatment conditions. Data were collected 21 days after sowing in 2023. The traits collected include seedling height, leaf count, seedling fresh weight, and seedling dry weight.
This dataset (4A.SAP_2023_field_phenotype.csv.gz) contains phenotype data collected from the sorghum association panel. The field experiment was conducted in Lincoln, Nebraska, in 2023, with plants grown under high (optimal) nitrogen and low nitrogen conditions, each with two replicates. The traits collected include panicle number, panicle weight, rachis length and rachis diameter, days to bloom, and plant height.
Year 2024
This dataset (5.SAP_2024_field_Phenotype.csv.gz) contains phenotype data collected from the sorghum association panel. The field experiment was conducted in Lincoln, Nebraska, in 2024, with plants grown under high (optimal) nitrogen and low nitrogen conditions, each with one replicate. The traits collected are node number, plant height, chlorophyll content, and senescence percentage.
File list
A1.SAP_SNPs_v5.vcf.gz
Whole-genome resequencing SNP markers for the Sorghum Association Panel (SAP), aligned to the BTx623 v5 reference genome.
A2.SAP_InDel_v5.vcf.gz
Whole-genome resequencing InDel markers for the Sorghum Association Panel (SAP), aligned to the BTx623 v5 reference genome.
B.SDP_SAP_815_maf001_het10_imputed.recode.vcf.gz
RNA-seq derived marker dataset for the Sorghum Diversity Panel (SDP) and SAP lines, aligned to the BTx623 v5 reference genome.
1.SAP_2020_Field_Phenotype.zip
Plot-level field phenotype data collected in Lincoln, Nebraska, in 2020 for the Sorghum Association Panel under high- and low-nitrogen treatments.
2.SDP_SAP_f2021_field_phenotype.zip
Plot-level field phenotype data collected in Lincoln, Nebraska, in 2021 for SAP and SDP entries under contrasting nitrogen conditions.
3.SAP_2022_field_phenotype_Alabama_Location.zip
Flowering-time phenotype data collected in 2022 at the Alabama location for the Sorghum Association Panel under high- and low-nitrogen treatments.
4A.SAP_2023_field_phenotype.csv.gz
Field phenotype data collected in Lincoln, Nebraska, in 2023 for the Sorghum Association Panel under high- and low-nitrogen treatments.
4B.SAP_2023_GH_phenotype.csv.gz
Greenhouse seedling phenotype data collected in 2023 for a subset of Sorghum Association Panel genotypes under high- and low-nitrogen treatments.
5.SAP_2024_field_Phenotype.csv.gz
Field phenotype data collected in Lincoln, Nebraska, in 2024 for the Sorghum Association Panel under high- and low-nitrogen treatments.
Variable descriptions for tabular files
General categorical legends
- HN = high nitrogen treatment
- LN = low nitrogen treatment
- NR = nitrogen response, calculated from high- and low-nitrogen values
- Plot / PlotID / UID / plot_id = unique identifier for an experimental plot or observation
- Geno / Genotype / Taxa / Name / SorghumName = sorghum accession or line identifier, depending on the dataset
1.SAP_2020_Field_Phenotype.zip
This file contains plot-level phenotype data collected from the 2020 field experiment in Lincoln, Nebraska, for the Sorghum Association Panel under contrasting nitrogen treatments.
Variables:
-
PlotID: unique identifier for each field plot
-
SorghumAccession: accession identifier for the sorghum line
-
SorghumName: sorghum line name
-
SNPDataID: identifier linking the line to genotype data
-
Row: field row position of the plot
-
Column: field column position of the plot
-
Block: field block number
-
Treatment: nitrogen treatment; LowNitrogen = low nitrogen, SufficientNitrogen = high/optimal nitrogen
-
DaysToBloom: number of days from planting to flowering/bloom
-
MedianLeafAngle: median leaf angle measurement
-
LeafAngleSDV: standard deviation of leaf angle measurements
-
PoorStand?: stand quality indicator; Y = poor stand, N = acceptable stand
-
PaniclesPerPlot: number of panicles observed in the plot
-
PanicleGrainWeight: grain weight from harvested panicles
-
EstimatedPlotYield: estimated grain yield for the plot
-
FlagLeafLength: flag leaf length
-
FlagLeafWidth: flag leaf width
-
ExtantLeafNumber: number of leaves present at scoring
-
PlantHeight: plant height
-
ThirdLeafLength: third leaf length
-
ThirdLeafWidth: third leaf width
-
TillersPerPlant: number of tillers per plant
-
StemDiameterLower: lower stem diameter
-
StemDiameterUpper: upper stem diameter
-
RachisLength: rachis length
-
RachisDiameterLower: lower rachis diameter
-
RachisDiameterUpper: upper rachis diameter
-
PrimaryBranchNo: number of primary branches
-
BranchInternodeLength: branch internode length
-
MoisturePCT: grain moisture percentage
-
ProteinPCT: grain protein percentage
-
OilPCT: grain oil percentage
-
AshPCT: grain ash percentage
-
StarchPCT: grain starch percentage
-
KernelColor: kernel color category
2.SDP_SAP_f2021_field_phenotype.zip
This file contains plot-level phenotype data collected from the 2021 field experiment in Lincoln, Nebraska, including Sorghum Association Panel and Sorghum Diversity Panel entries under contrasting nitrogen treatments.
Variables:
-
Plot: unique identifier for each field plot
-
Treatment: nitrogen treatment; HN = high nitrogen, LN = low nitrogen
-
Rep: replicate number
-
Row: field row position of the plot
-
Column: field column position of the plot
-
GenoIDFromMiaoEtAl: genotype identifier from the referenced prior dataset
-
PINumber: PI identifier for the sorghum line
-
SAPID: Sorghum Association Panel identifier
-
SorghumConversionID: sorghum conversion line identifier
-
Name: genotype or line name
-
Plant Height: plant height
-
DaysToFlower: number of days from planting to flowering
-
TillersPerPlant: number of tillers per plant
-
StemDiameterLower: lower stem diameter
-
StemDiameterUpper: upper stem diameter
-
PaniclesPerPlant: number of panicles per plant
-
LeafNumber: number of leaves
-
LeafLength: leaf length
-
LeafWidth: leaf width
-
SeedMassPerPlant: seed mass per plant
-
SeedProtein: seed protein content
-
SeedOil: seed oil content
-
SeedAsh: seed ash content
-
SeedStarch: seed starch content
-
SeedMoisture: seed moisture content
-
SeedColor: seed color category
3.SAP_2022_field_phenotype_Alabama_Location.zip
This file contains flowering-time phenotype data collected from the Sorghum Association Panel at the Alabama location in 2022 under contrasting nitrogen treatments.
Variables:
-
Plot: unique identifier for each field plot
-
Genotype: sorghum accession or line identifier
-
Treatment: nitrogen treatment; HN = high nitrogen, LN = low nitrogen
-
Block: field block number
-
Row: field row position of the plot
-
Column: field column position of the plot
-
HeadingDate: heading date measured as days to heading/flowering
4A.SAP_2023_field_phenotype.csv.gz
This file contains 2023 field phenotype data for the Sorghum Association Panel.
Variables:
-
UID: unique identifier for each plot or observation
-
geno: sorghum accession or genotype identifier
-
panicle_num: number of panicles
-
panicle_weight: total panicle weight
-
panicle_avg: average panicle weight
-
rachis_len: rachis length
-
rachis_dia: rachis diameter
-
DaysToBloom: number of days from planting to flowering/bloom
-
PlantHeight: plant height
-
Treatment_Type: nitrogen treatment; HN = high nitrogen, LN = low nitrogen
4B.SAP_2023_GH_phenotype.csv.gz
This file contains 2023 greenhouse phenotype summary values for Sorghum Association Panel genotypes under high- and low-nitrogen conditions and their nitrogen response.
Variables:
-
Taxa: sorghum accession or genotype identifier
-
SDW.HN: seedling dry weight under high nitrogen
-
SDW.LN: seedling dry weight under low nitrogen
-
SDW.NR: nitrogen response for seedling dry weight
-
SFW.HN: seedling fresh weight under high nitrogen
-
SFW.LN: seedling fresh weight under low nitrogen
-
SFW.NR: nitrogen response for seedling fresh weight
-
LC.HN: leaf count under high nitrogen
-
LC.LN: leaf count under low nitrogen
-
LC.NR: nitrogen response for leaf count
-
PH.HN: plant height under high nitrogen
-
PH.LN: plant height under low nitrogen
-
PH.NR: nitrogen response for plant height
5.SAP_2024_field_Phenotype.csv.gz
This file contains 2024 field phenotype data for the Sorghum Association Panel.
Variables:
-
plot_id: unique identifier for each plot
-
geno: sorghum accession or genotype identifier
-
block: field block number
-
column: field column position of the plot
-
range: field range position of the plot
-
experiment: nitrogen treatment; HN = high nitrogen, LN = low nitrogen
-
Chl 1: first chlorophyll measurement
-
Chl 2: second chlorophyll measurement
-
Chl 3: third chlorophyll measurement
-
Node Number: number of nodes
-
Plant Height 1: first plant height measurement
-
Plant Height 2: second plant height measurement
-
Plant Height 3: third plant height measurement
Units of measure
-
DaysToBloom, DaysToFlower, and HeadingDate are recorded in days.
-
PlantHeight, Plant Height, Plant Height 1, Plant Height 2, Plant Height 3, FlagLeafLength, ThirdLeafLength, LeafLength, and rachis_len / RachisLength are recorded in centimeters (cm).
-
FlagLeafWidth, ThirdLeafWidth, LeafWidth, StemDiameterLower, StemDiameterUpper, rachis_dia / RachisDiameterLower / RachisDiameterUpper are recorded in millimeters (mm).
-
PanicleGrainWeight, EstimatedPlotYield, SeedMassPerPlant, panicle_weight, panicle_avg, SDW, and SFW are recorded in grams (g).
-
MoisturePCT, ProteinPCT, OilPCT, AshPCT, StarchPCT, SeedProtein, SeedOil, SeedAsh, SeedStarch, and SeedMoisture are recorded as percentages (%).
-
PaniclesPerPlot, PaniclesPerPlant, panicle_num, TillersPerPlant, LeafNumber, ExtantLeafNumber, PrimaryBranchNo, and Node Number are recorded as counts.
-
Chl 1, Chl 2, and Chl 3 are chlorophyll meter readings (SPAD units).
-
KernelColor and SeedColor are categorical variables.
Missing values
Empty cells indicate missing or unavailable data.
Blank cells should not be interpreted as zero unless a value of 0 is explicitly recorded in the dataset.
Citations:
- Boatwright L, Sapkota S, Jin H, Schnable JC, Brenton Z, Boyles R, Kresovich S (2022) Sorghum association panel whole-genome sequencing establishes pivotal resource for dissecting genomic diversity. The Plant Journal doi: 10.1111/tpj.15853 bioRxiv doi: 10.1101/2021.12.22.473950
- Grzybowski M, Zweiner M, Jin H, Wijewardane NK, Atefi A, Naldrett MJ, Alvarez S, Ge Y, Schnable JC (2022) Variation in morpho-physiological and metabolic responses to low nitrogen stress across the sorghum association panel. BMC Plant Biology 10.1186/s12870-022-03823-2 bioRxiv doi: 10.1101/2022.06.08.495271
- Mangal H, Linders K, Turkus J, Shrestha N, Long B, Kuang X, Cebert E, Torres-Rodriguez JV, Schnable JC Genes and pathways determining flowering time variation in temperate adapted sorghum. bioRxiv doi: 10.1101/2024.12.12.628249
- Shrestha N, Mangal H, Torres-Rodriguez JV, Tross MC, Lopez-Corona L, Linders K, Sun G, Mural RV, Schnable JC (2025) Off-the-shelf image analysis models outperform human visual assessment in identifying genes controlling seed color variation in sorghum. The Plant Phenome Journal doi: 10.1002/ppj2.70013 bioRxiv doi: 10.1101/2024.07.22.604683
- Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884-i890
- Li, H. (2013). Aligning sequence reads, clone sequences, and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3
- McCormick, R. F., Truong, S. K., Sreedasyam, A., Jenkins, J., Shu, S., Sims, D., ... & Mullet, J. E. (2018). The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization. The Plant Journal, 93(2), 338-354.
- Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., ... & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15-21.
- Bray, N. L., Pimentel, H., Melsted, P., & Pachter, L. (2016). Near-optimal probabilistic RNA-seq quantification. Nature biotechnology, 34(5), 525-527.
- Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), 2114-2120.
- Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., ... & Li, H. (2021). Twelve years of SAMtools and BCFtools. Gigascience, 10(2), giab008.
