Code review regression analysis of open source GitHub projects
Data files
Aug 31, 2017 version files 10.56 MB
Abstract
Methods
We pulled from the sub-population of GitHub repositories that had at least 10 pushes, 5 issues, and 4 contributors from 2012 to 2014. We used the GitHub Archive, a collection of all public GitHub events, to generate a list of all such repositories. This gave us 48,612 candidate repositories in total. From this candidate set, we randomly sampled 5000 repositories. We wrote a scraper to pull all non-commit data (such as descriptions and issue and pull request text and metadata) for a GitHub repository through the GitHub API, and used it to gather data for each repository in our sample. After scraping, we had 4,937 repositories (due to some churn in GitHub repositories).
For each language used by each repository, we manually labeled it on two independent axes: whether it was a programming language, and whether it is memory-safe.
We used two quantification models (as explained in our paper) to estimate the number of issues in each repository that were security bugs. The results of each are in separate dataset files (`repos_data_nn.csv` and `repos_data_rfcc.csv`).
Usage notes
Our main analysis (as reported in our paper) is contained in the Jupyter notebook `Regression.ipynb`. To run it, you need an active Jupyter instance running with the R kernel. We ran these analyses using R version 3.3.1 and the `ggplot2`, `reshape2`, `plyr`, `car`, `tibble`, and `ggfortify` packages. Full system details are available at the bottom of each notebook.
Additionally, we include the extracted pure R versions of each notebook, as well as pre-rendered static HTML versions. These can be viewed without any installed software.