# Data and code for code review regression analysis This dataset contains the repository data used for our study "A Large-Scale Study of Modern Code Review and Security in Open Source Projects" ([DOI](https://doi.org/10.1145/3127005.3127014)). This dataset was collected from GitHub, and includes 3,126 projects in 143 languages, with 489,038 issues and 382,771 pull requests (after data processing). It includes the full text and metadata of issues and pull requests, as well as metadata about each repository itself. Also included are the main notebook used for the regression analysis reported in our paper (`Regression.ipynb`), as well an alternate notebook used for testing our analysis on data generated from an alternate quantification model (`Regression-RFCC.ipynb`). ## Contents - Regression.ipynb - The main Jupyter notebook containing our full data processing and regression analysis (in R). - Regression-RFCC.ipynb - The same analysis as our main notebook, but using security issues as estimated by our alternate quantification model. - Regression.R and Regression-RFCC.R - Extracted pure R versions of the two Jupyter notebooks. - Regression.html and Regression-RFCC.html - Rendered versions of the two Jupyter notebooks, as static HTML for viewing. - repos_data_nn.csv - Our main dataset, containing the full repository set, and per-repository security issues estimated using our neural network quantification model. This is the dataset used for the analysis reported in our paper. - repos_data_rfcc.csv - The same dataset, but with per-repository security issues estimated using an alternate random-forest classify-and-count (RFCC) quantification model. - lang-info.csv - A dataset of languages in our main dataset, hand-labeled with whether they are programming languages (vs. markup languages, etc.) and whether they are memory-safe ("MAYBE" means they are by default, but can be used in a unsafe manner). - population.csv - The entire population of repositories from which we sampled our dataset. Gathered using the [GitHub Archive](https://www.githubarchive.org/). ## How we gathered this data We pulled from the sub-population of GitHub repositories that had at least 10 pushes, 5 issues, and 4 contributors from 2012 to 2014. We used the GitHub Archive, a collection of all public GitHub events, to generate a list of all such repositories. This gave us 48,612 candidate repositories in total. From this candidate set, we randomly sampled 5000 repositories. We wrote a scraper to pull all non-commit data (such as descriptions and issue and pull request text and metadata) for a GitHub repository through the GitHub API, and used it to gather data for each repository in our sample. After scraping, we had 4,937 repositories (due to some churn in GitHub repositories). For each language used by each repository, we manually labeled it on two independent axes: whether it was a programming language, and whether it is memory-safe. We used two quantification models (as explained in our paper) to estimate the number of issues in each repository that were security bugs. The results of each are in separate dataset files (`repos_data_nn.csv` and `repos_data_rfcc.csv`). ## Usage notes Our main analysis (as reported in our paper) is contained in the Jupyter notebook `Regression.ipynb`. To run it, you need an active Jupyter instance running with the R kernel. We ran these analyses using R version 3.3.1 and the `ggplot2`, `reshape2`, `plyr`, `car`, `tibble`, and `ggfortify` packages. Full system details are available at the bottom of each notebook. Additionally, we include the extracted pure R versions of each notebook, as well as pre-rendered static HTML versions. These can be viewed without any installed software.