Skip to main content

Common field data limitations can substantially bias sexual selection metrics

Cite this dataset

Cramer, Emily; Kaiser, Sara Ann; Webster, Mike S; Ryder, T Brandt (2020). Common field data limitations can substantially bias sexual selection metrics [Dataset]. Dryad.


Sexual selection studies widely estimate several metrics, but values may be inaccurate because standard field methods for studying wild populations produce limited data (e.g., incomplete sampling, inability to observe copulations directly). We compared four selection metrics (Bateman gradient, opportunity for sexual selection, opportunity for selection, and s’max) estimated with simulated complete and simulated limited data for 15 socially monogamous songbird species with extra-pair paternity (4-54% extra-pair offspring). Inferring copulation success from offspring parentage creates non-independence between these variables and systematically underestimates copulation success. We found that this introduces substantial bias for the Bateman gradient, opportunity for sexual selection, and s’max. Notably, 47.5% of detected Bateman gradients were significantly positive for females, suggesting selection on females to copulate with multiple males, though the true Bateman gradient was zero. Bias generally increased with the extent of other sources of data limitations tested (nest predation, male infertility, and unsampled floater males). Incomplete offspring sampling introduced bias for all metrics except the Bateman gradient, while incomplete sampling of extra-pair sires did not introduce additional bias when sires were a random subset of breeding males. Overall, our findings demonstrate how biases due to field data limitations can strongly impact the study of sexual selection.


Empirical data were gathered from the published literature and by collecting additional unpublished information from published datasets. 

Usage notes

This data supplement contains 7 items, with dual goals of providing all the code needed to replicate the results in the manuscript and providing code to researchers who want to explore bias in their own study systems. Note that results for individual study systems should be viewed with caution, as assumptions about the exact mechanisms of extra-pair copulations and fertilizations may not be appropriate for a particular study system.

The utility for other researchers to explore their own study systems was a secondary goal. Thus the code is not streamlined for this application. Suggestions for how to approach this are at the end of this document.

We provide:
1. Simulation code used to generate populations from multiple different species, with flexibility in the number of data limitations and their degree (MethodsMSSimCode_20203027.R). Within the simulation, each population's Bateman gradients are calculated and saved, as are data needed to calculate the other metrics. Other information about the population is not stored.
The simulation code outputs four dataframes that are intended to be written as .csv files, to be re-opened and analysed separately. "GrandvarianceTerms" includes some descriptive information about the study population (number of extra-pair offspring), as well as true and detected variance in reproductive and copulation success. It is one row per simulation iteration.
"GrandGradientSummary" contains the model output for eight Bateman gradients per simulation iteration (true and detected gradients for each of two sexes, with both a raw and a relativized form). The different types of Bateman gradient are identified in a column named "Model" with levels: FemaleBatemanApparent, MaleBatemanApparent, FemaleBatemanTrue, MaleBatemanTrue, MaleBatemanDetRel_Terr, MaleBatemanTrue_Both, FemaleBatemanApparentRel, and FemaleBatemanTrueRel. The latter four are the relativized forms of the gradient.
"GrandPopSUmmaryTot"  summarizes, for each replicate population, the number of extra-pair males that sired offspring in each brood. Length of the output will vary.
"GrandPopSummaryNEPTot" summarizes, for each replicate population, the number of extra-pair offspring for each brood size. Length of the output will vary.

2. Though they are not used in the main study, we include two additions relevant for researchers wanting to model their own systems. First, you can add random noise to your estimates of m and s by un-commenting the code at lines 90-91 and 153-154 (consider commenting out lines 102 and 103). This incorporates up to 10% variation in m and s (and can easily be adjusted to add a different amount of noise). 
Second, you can incorporate spatial limitations. Because this was a longer piece of code, we have supplied it as a separate file (ModificationsToIncludeSpatialImpacts.R), which can be used to replace the code at lines 149-192 in the main simulation code.

3. Code used to analyze the output from the simulations (ExaminingSims_Methods_20200327_commented.R). The code for statistical analysis and visualization starts by re-shaping the data and combining content from the GrandVarianceTerms and GrandGradientSummary dataframes in order to make a dataframe for s'max (which is then written as a file and re-read from the file for convenience). The processing includes a number of steps that may not be useful for examining one's own study population, but that were needed for our analyses. We have erred on the side of including more than necessary, and noting places where we expect most readers would prefer to comment out the code (below in specific instructions for simulating individual populations).
For the analyses presented in the paper, there are three main dataframes for statistics (ie., that contain the mean values across replicate iterations of the simulation) and three main dataframes with information on the individual iterations, which were used in preparing figures. Dataframes with the word Variance in the name are optimized for opportunity for selection. Dataframes with Gradients in the name are for the Bateman gradient. Dataframes with JonesIndex in the name are for s'max, the relativized Bateman gradient, and opportunity for sexual selection.
For each separate analytical question, the main dataframe is filtered to contain only one source of data limitation at a time.
Starting at line 390, this code begins the analysis presented in the main paper. Analysis follows the overall structure in the paper: first looking at basic simulation conditions, then at infertility, then predation, etc. Each of these major subdivisions has a header in the code, to facilitate quick access.
Within each major heading, analysis is first from the variance dataframe (i.e., opportunity for selection), then from the gradients dataframe (Bateman gradients), then from the JonesIndex dataframe (first s'max then opportunity for sexual selection).
Within each response variable, first the dataframe is filtered for graphing and analysis. Then there is the main statistical analysis, then calculation of Cohen's d, then graphing. Note that the structure of the data was different in the variance data frame (where a subset = statement is needed for the analysis to include only opportunity for selection) compared to the others (where each column represents a separate response variable and no subsetting is required)
The overall approach was to save output from all the statistical tests into dataframes (one for ANOVA and one for parameter estimats), consolidate them, and construct output tables (starting line 2321). 
Finally, example code for visualizing how extra-pair offspring and extra-pair sires are distributed across broods begins at line 2371. This section is not exhaustive, as it is only supplementary to the main paper.

4 and 5. Our raw data input files, named "BrommerEtAlRawDataFile.csv" and "BrommerSpeciesOverViews.csv". The former contains three columns with no headers, and one row per nest. The first column gives a population identifying number. The second is the number of extra-pair offspring in the nest. The third is the brood size.
The latter file contains a summary, one row per population (including all the populations from Brommer et al. 2010, of which we used only 4). It has column headers. The columns are: Population (a population identifying number, to match with the BrommerSpeciesOverViews.csv file), species (an abbreviated form of the common name), m (described in the paper; mean of the Poisson distribution for number of extra-pair copulations), s (described in the paper; the sperm competition paramter), N (the number of nests used in estimating m and s), PctEPY (the empirical estimate of the percent of offspring sired by extra-pair males), PctEPNests (the percent of nests containing at least one extra-pair offspring), and ApproxAveClutchSize (the average brood size, estimated from the same dataset used in estimating m and s), and Latin (the scientific name of the species, with genus and specific epithet joined by an underscore; note that this uses older nomenclature for compatibility with For most users, only Population, species, m, and s are necessary; other columns were used in down-stream analysis and are not needed in the simulation.

6 and 7. For researchers wanting to evaluate their own study system, we include our modification of Brommer et al.'s code for estimating m and s. This consists of two files; one is R code (FindingMandS.R) and the other is a .txt file specifying their Bayesian model (BrommerEtAlModel_Mod.txt.; simplfied to handle one species at a time). These files should be in the same directory when you run the code.
The R code works by calling the Bayesian model. The input file is two columns with headers "epy.obs" and "brood.size", and one row per nest. For each nest, the number of extra pair offspring and the brood size is given.

Advice for researchers wanting to examine their own study system: 
First estimate m and s. (FindingMandS.R)

Next, for easiest use in our code, slightly modify your input file for estimating m and s to create one of the required input files (equivalent to BrommerEtAlRawDataFile.csv). It should be one row per brood, with no column headers. Column 1 is the species or population identification number. Column 2 is the number of extra-pair offspring. Column 3 is the brood size. These are named early in the R code.
The second data file (BrommerSpeciesOverViews.csv) must include the same species or population identification number, m, and s. Follow the column names given above, or use our file as a template. 

Next, update the simulation code (MethodsMSSimCode_20203027.R) to your desired settings. 

You may also update your input file names (L 9 and 12)

Update varibles:
NSims (L72) for the number of replicates per simulation condition.
PopsOfInterest (L73), a vector of the individual species to be considered (here labelled as populations; terminology inherited from Brommer et al. 2010, which examined several separate populations per species; species and population are redundant in this code).
TotalInfertilityList (L79): a COUNT of the number of males infertile with fertile sperm precedence, assuming a population of 400 males
PartialInfertilityList (L80): a COUNT of the number of males infertile without fertile sperm precedence, assuming a population of 400 males
PredationCountList (L81): a COUNT of the number of nests depredated (whole-brood depredation), assuming 400 nests initially 
UnbandedMaleList (L82): a COUNT of the number of territorial males not sampled, assuming 400 males in the population
UnfoundNestList (L83): a COUNT of the number of nests not sampled, assuming 400 nests in the population 
FloaterMalesList (L84): a COUNT of the number of floater males to be added to the population of 400 territorial males 
PartialBroodLossList (L85): a PROPORTION of offspring to be depredated (randomly with respect to nest of origin; partial-brood depredation)
Note that in the described simulations, only one of these variables was non-zero per simulation run; we have not fully automated this code. 

If you would like, you can add random noise around your estimates of m and s by un-commenting-out lines 90-91 and 153-154 (consider commenting out lines 102 and 103, which then are redundant).
If you would like to incorporate spatial constraints, replace the code at lines 149-192 in the main simulation with the code in file "ModificationsToIncludeSpatialImpacts.R". Note that this is NOT compatible with including floater males in the simulations. 

Run your simulations (by running lines 88 - 590). Save the files as you prefer (L 593-596).

Next, examine the results by using "ExaminingSims_Methods_20200327_commented.R". For purely descriptive purposes, the applicable portions of this code are up to line 353, 671-384, and after line 2371.
If your goal is to compare true and detected values under different simulation conditions, we recommend that you run the first 353 lines. You will need to update the read-in file names at lines 19, 32, 246, and 337. You may also want to re-name the file to be written containing s'max calculations at line 335.
If you use the same simulation results multiple times, you may comment out lines 117-125, 276-283, and 332-335, which are needed for generating the dataset for s'max.
Our process required merging datasets and re-naming a column, which may not be necessary for your dataset: please check and comment out if needed lines 111-113, 185-187, 291-293. Failure to do this may mess up your variable names.

To examine true and detected values estimated from a study population, the best data frames are VarianceTermsSummarySupp (which gives mean, standard deviation, and 95% confidence limits for opportunity for selection as outputs); GradientsSummary (providing similar information for the Bateman gradient); and JonesIndexSumm(providing information for s'max--called here the JonesIndex for coding ease--and opportunity for sexual selection) 

To examine histograms of extra-pair offspring and sires, modify and run code starting at line 2372.