Data cleaning for bee phenology analysis Michael Stemkovski m.stemkovski@gmail.com ### SUMMARY ### Processing is done on the raw bee occurrence data to generate time-series, which are then passed to the phenophase estimator (see readme_phenophase.txt). Processing is done in "data_cleaning.R" using "bees_2019-05-08.csv" and "samples_2019_10_15.csv". ### DATA DETAILS ### "bees_2019-05-08.csv" contains raw bee occurrence data corresponding to the insect collection. Method for sampling are described in the paper. There are many columns, with many of them used for housekeeping purposes. Below are descriptions of the ones relevant to the analysis. Some specimens could not be identified, are discarded because they were flies/wasps, or were lost, so some record are blank. Year: Calendar year in which specimen was collected. Site: Location at which specimen was collected. Method: Method my which specimen was collected. "Net" is netting by hand, usually from flowers, and "Bowl" is pan-trapping. Details: Either the floral species association or color of the pan trap in which the specimen was caught. This isn't used in the present analysis, but may be useful for future studies. Date_sampled: day and month on which the specimen was collected. Genus_species: Species name. Sex: Sex of the specimen. "samples_2019_10_15.csv" records the sampling effort for each collection day. This is used to generate the zeros in the time-series and to calculate abundance based on sampling effort. The relevant columns are as follows. The columns relevant for netting were not used in the present study, but may be useful for subsequent work. Year: Calendar year in which sampling was performed. Date_sampled: day and month on which sampling was performed. Site: Location of sampling. Block: Sampling was usually performed in groups of multiple sites, so sites within blocks usually have the same dates. (not used in the analysis) bowls_down: Time at which pan traps were places. bowls_up: Time at which specimens were collected from traps. bowl_time: Elapsed time that pan traps were active. This is the sampling effort used in the analysis. The times vary due to logistics and weather. start_am: Time when netting was started in the morning. am_nettime_hrs: Duration of netting in the morning. This is total time between all netters. (if there are three netters, they would do 20 minutes each) start_pm: Time when netting was started in the afternoon. pm_nettime_hrs: Duration of netting in the afternoon. total_time: Sum of am and pm netting time. ### CODE DETAILS ### "data_cleaning.R" generates time-series as number of individuals per sampling period divided by the sampling time in that day. By default, singletons are excluded. This is cases in which there was only one individual of a species in a given year/site. These are either anomalous catches or mistakes in data entry. This can be turned off, but I recommend against it. The "codes" section can be ignored. This is a relic from when the output was being passed to Matlab for phenophase estimation early in the project. See the comments in the script for further details.