Skip to main content

Modeling cannabinoids from a large-scale sample of Cannabis sativa chemotypes

Cite this dataset

Vergara, Daniela; Gaudino, Reggie; Blank, Thomas; Keegan, Brian (2020). Modeling cannabinoids from a large-scale sample of Cannabis sativa chemotypes [Dataset]. Dryad.


The widespread legalization of Cannabis has opened the industry to using contemporary analytical techniques for chemotype analysis. Chemotypic data has been collected on a large variety of oil profiles inherent to the cultivars that are commercially available. The unknown gene regulation and pharmacokinetics of dozens of cannabinoids offer opportunities of high interest in pharmacology research.  Retailers in many medical and recreational jurisdictions are typically required to report chemical concentrations of at least some cannabinoids. Commercial cannabis laboratories have collected large chemotype datasets of diverse Cannabis cultivars. In this work a data set of 17,600 cultivars tested by Steep Hill Inc., is examined using machine learning techniques to interpolate missing chemotype observations and cluster cultivars into groups based on chemotype similarity.   The results indicate cultivars cluster based on their chemotypes, and that some imputation methods work better than others at grouping these cultivars based on chemotypic identity. Due to the missing data and to the low signal to noise ratio for some less common cannabinoids, their behavior could not be accurately predicted. These findings have implications for characterizing complex interactions in cannabinoid biosynthesis and improving phenotypical classification of Cannabis cultivars.


Agricultural Genomics Foundation

Natural Hazards Center, University of Colorado Boulder, Award: gift fund 13401977-Fin8