Evaluating the quality of ecoinformatics data derived from commercial agriculture: a repeatability analysis of pest density estimates


Each year, consultants and field scouts working in commercial agriculture undertake a massive, decentralized data collection effort as they monitor insect populations to make real-time pest management decisions. These data, if integrated into a database, offer rich opportunities for applying big data or ecoinformatics methods in
agricultural entomology research. However, questions have been raised about whether or not the underlying quality of these data is sufficiently high to be a foundation for robust research. Here I suggest that repeatability analysis can be used to quantify the quality of data collected from commercial field scouting, without requiring any additional data gathering by researchers. In this context, repeatability quantifies the proportion of total variance across all insect density estimates that is explained by differences across populations and is thus a measure of the underlying reliability of observations. Repeatability was moderately high for cotton fields scouted commercially for total Lygus hesperus Knight densities (R = 0.631) and further improved by accounting for observer effects (R = 0.697). Repeatabilities appeared to be somewhat lower than those computed for a comparable, but much smaller, researcher-generated data set. In general, the much larger sizes of ecoinformatics data sets are likely to more than compensate for modest reductions in measurement precision. Tools for evaluating data quality are important for building confidence in the growing applications of
ecoinformatics methods.  Here I report the raw data that support these analyses in two files, one reporting the data gathered by the commercial pest control consultant and his summer field scouts and the second gathered by a team or university researchers.


All the data were gathered using sweep sampling in commercial cotton fields. A single sample is gathered by swinging the insect net 50 times across the top of the plant canopy and then counting the number of nymphal and adult Lygus spp. captured.

Usage Notes

Metadata pages included with each data file give simple explanations for each variable.  The number of sweep samples made per field varies; thus, there are many NA values for fields that received fewer than the maximum number of sweep samples.


U.S. Department of Agriculture, Award: 2015-70006-24164