Background:

The numbers of days that people consume alcohol and other drugs over a fixed time interval, such as 28 days, are often collected in surveys for research in the addictions field.
The presence of an upper bound on these variables can result in response distributions with "ceiling effects".
Also, if some peoples’ substance use behaviors are characterized by various weekly patterns of use, summaries of substance days-of-use over longer periods can exhibit multiple modes. Multiple modes can also result from "heaping" of responses when respondents are unsure about the precise value.
These characteristics of substance days-of-use data mean that models assuming common parametric response distributions will not always provide a good fit.

Repository contents:

Simulate longitudinal cannabis days-of-use over 28-day intervals intended to reproduce characteristics of data reported by respondents to an Australian survey of illicit drug users run over 4 waves during the COVID-19 pandemic in Australia in 2020–21. The dataset includes generated subject_id and survey_wave and iso explanatory variables, where iso is a dummy variable indicating subjects that were in quarantine or isolation at the time of the 28-day interval.

R-code to fit proportional-odds and continuation-ratio ordinal models as well as binomial, beta-binomial, negative binomial and hurdle negative binomial models to these data are available at a linked companion website.

We fitted a Bayesian multinomial model to reported cannabis days-of-use over four 28-day intervals (four survey waves) during the COVID-19 pandemic in Australia. Cannabis days-of-use was modeled as a nominal categorical variable with 29 levels, one for each possible response (0 days, 1 day, ..., 28 days).

The model, fitted to responses by 443 illicit drug users across four survey waves, included only survey wave and isolation status (in isolation or quarantine yes/no) as explanatory variables with subject_id as a random intercept.

A simulated sample of 600 participants was generated by twice subsampling 300 subject_ids without replacement from the full set of 443. Most participants will have been selected in both subsamples.

A single cannabis days-of-use was simulated for 2 subsamples x 300 subject_ids x 4 survey waves = 2400 28-day intervals. The cannabis days of use simulated response was generated by a single draw from the posterior predictive distribution for each subsample.

The survey wave and isolation explanatory variables and subject_id are included in the supplied dataset. Survey participants are not identifiable.

The data are provided in an R dataset, synthetic_cannabis_use.RData.

In order to run R code accompanying the dataset, the Rstan software package https://mc-stan.org/users/interfaces/rstan also needs to be installed.

Simulated cannabis days-of-use data

Data files

Abstract

Simulated cannabis days-of-use data

Data files

Abstract

Methods

Usage notes

Works referencing this dataset