King, Rachael; Roland, Edwin (2021), Epistolary Pamphlets, Dryad, Dataset, https://doi.org/10.25349/D90K72
These spreadsheets show the results of an effort to identify the epistolary pamphlet genre within the ECCO-TCP database. Sheet 1 shows the model's predictions of whether a particular text was an epistolary pamphlet while Sheet 2 identifies the tokens used to build the model. Overall, the model was able to identify the genre with 76 percent accuracy.
This statistical analysis uses an L2-regularized logistic regression with feature subsetting. The optimal number of features to include in the model (n = 1000) and the optimal regularization constant (C = 0.0001) were learned by a grid search with five-fold cross validation. The predictions and coefficients reported in the paper were estimated by bootstrap. Because the numbemetr of texts in the training corpus was small (69 epistolary pamphlets by 39 authors), the results that are reported in this paper are subject to strenuous testing for validity and generalizability. In order to ensure that results will generalize beyond the small number of authors considered, the number of texts by each author was downsampled and cross-validation was performed by folding over authors rather than texts. In order to ensure the validity of the findings in this paper, statistical significance of all reported values was measured conservatively. While measures of significance would ideally be determined directly by bootstrap, that method would require an infeasible amount of computation time and memory for this study. Instead, only values whose bootstrapped distributions were approximately normal were considered, since their significance levels could be estimated by z-score. The threshold for significance was set at p = 0.05 with the Bonferroni adjustment.