# Data for occupancy–detection models with museum specimen data: promise and pitfalls --- The dataset contains simulated biodiversity occurrence data used to test the efficacy of occupancy-detection models with natural history specimen occurrence information. Also included are data from the Global Biodiversity Information Facility (DOI cited in the paper) for North American dragon- and damselflies used in the case study analysis. ## Description of the aata and file structure This .zip file contains the code/data environment used for the manuscript entitled, "Utilizing occupancy-detection models with museum specimen data: promise and pitfalls" authored by Vaughn Shirey, Rassim Khelifa, Leithen M'Gonigle, and Laura Melissa Guzman. For convienence, compiled model results from each workflow/simulation are presented in the .rds files. The .rds files can be considered the primary data of the study. All of the code (included those used for simulation) and results are contained within the .zip file. The following outlines the contents of each .rds file: * sim_range_all_censored.rds: simulated data and model results for hypothetical species where ranges are simulated and the model is censored only to regions where the species is found to occur. * species_orig_occupancy.rds: the original occupancy status of all species in the study. * sim_range_all_uncensored.rds: simulated data and model results for hypothetical species where ranges are simulated but the model is not censored to only regions where the species is found to occur. * sim_all_out.rds: simulated data and model results for hypothetical species that can occur in all sites and are modeled in all sites. * sim_range_supp_out.rds: simulated data and model results for hypothetical species where ranges are simulated for a supplementary analysis. Each .rds file contains a dataframe with the same columns: * term: the parameter in the model being referenced by the other value columns. * estimate: the model-estimated value of the term. * lower: the lower 95% Bayesian credible interval estimate of the value of the term. * upper: the upper 95% Bayesian credible interval estimate of the value of the term. * svalue: the S-value of the term estimate. * rhat: the Gelman-Rubin convergence statistic for that parameter estimate. * true_val: the known, true value of the term. * visit_sim: the type of visit simulation (should always be "visits"). * visit_mod: the type of imputation of non-detection data (can be "all" meaning all sites had inferred non-detections; "detected" meaning only sites where one other species was detected; and "visits" meaning sites where the true visit history was known). * eras: The number of occupancy intervals in the model (2, 5 or 10). * r: The index of the simulation (1-10). * nyr: The number of years simulated (always 10). * prop.visits.same: The probability that a given visit in a simulation was community focused (0, 0.25, 0.5, 0.75, or 1). * mu.v.yr: The average trend in detection probability over simulated years. * s: The index of the model run. We strongly recommend the use of a computing cluster to reproduce this analysis. We utilized the ComputeCanada infrastructure in which we could specify which simulation to run models on by specifying array parameters to a slurm scheduler. ## Sharing/access Information The dragon- and damselfly dataset can be found via the following DOI: https://doi.org/10.15468/dl.cabqrc