English Catalogue of Books dataset, 1912-1922
Data files
Feb 18, 2026 version files 24.02 MB
-
english-catalogue-of-books_1912-1922.csv
24.02 MB
-
README.md
2.39 KB
Abstract
The English Catalogue of Books (1912-1922) dataset represents publishing information for over 99,400 titles issued in England and Ireland between 1912 and 1922 as reported by publishers and recorded in editions of the English Catalogue of Books issued by the trade publication, Publishers' Circular, between 1913-1923. Our data is based on OCR-generated text drawn from facsimiles of copies of the ECB held at Princeton University Libraries (1912–1918, 1920, 1922) and the New York Public Library (1919 and 1921), and made available through HathiTrust’s Digital Library. The dataset includes the following categories of information for each work (where available in the original catalogues): author name, title, book format, publisher, month of publication, and year of publication.
Dataset DOI: 10.5061/dryad.rr4xgxdn3
Description of the data and file structure
Files and variables
Data File:
english-catalogue-of-books_1912-1922.csv
Description: UTF-8 encoded. The file is provided as a comma-delimited CSV and follows standard CSV quoting conventions: fields containing commas are enclosed in double quotation marks ("), and embedded quotation marks within fields are represented by doubled quotation marks (""). No separate escape character is used.
Cells containing "N/A" indicate that a value is not available for that entry. This may mean:
- The information was not recorded in the original English Catalogue of Books, or
- the information was present in the source catalogue but could not be successfully extracted from OCR-generated text during parsing.
Variables
- index: index of entries
- original_entry: complete, unparsed entry drawn from OCR-generated text in catalogues
- author(s): author of text
- title: title of text
- format: format of text (folio, 4to., 8vo., etc.)
- publisher: publisher of text
- date: date of publication of text (month, year)
- catalogue_year: issue of the English Catalogue of Books from which the entry is drawn
- page_num: page number (as printed in ECB) from which the entry is drawn
- doc_page_num: page number from which the entry is drawn in digital facsimile
Code:
english-catalogue-of-books-1912-1922-1.zip
Description: Python code for ECB entry extraction and parsing.
File structure:
ecb_ocr_text/ Raw OCR text files (1902–1922)
entries/
extracted_entries/ Regex-extracted entries from OCR text (1902–1922)
hand_corrected_entries/ Manually reviewed entries (1912–1922)
parsed_dataframes/ LLM-parsed entries with structured fields (1912–1922)
scripts/
create_entries.py Extract entries from OCR text
llm_parser.py Parse entries into structured fields via Google Gemini
ai_output_accuracy_check.ipynb Quality check on LLM output
splitters.txt Year-specific regex patterns for entry extraction
Access information
Data was derived from the following source:
