Skip to main content

Frequencies per million words for 5 epidemiologically relevant search terms in a dozen British 19th century newspapers

Cite this dataset

Gatherer, Derek (2022). Frequencies per million words for 5 epidemiologically relevant search terms in a dozen British 19th century newspapers [Dataset]. Dryad.


COVID-19 is the first known coronavirus pandemic.  Nevertheless, the seasonal circulation of the four milder coronaviruses of humans – OC43, NL63, 229E and HKU1 – raises the possibility that these viruses are the descendants of more ancient coronavirus pandemics.  This proposal arises by analogy to the observed descent of seasonal influenza subtypes H2N2 (now extinct), H3N2 and H1H1 from the pandemic strains of 1957, 1968 and 2009, respectively.  Recent historical revisionist speculation has focussed on the influenza pandemic of 1889-1892, based on molecular phylogenetic reconstructions that show the emergence of human coronavirus OC43 around that time, probably by zoonosis from cattle.  If the “Russian influenza”, as The Times named it in early 1890, was not influenza but caused by a coronavirus, the origins of the other three milder human coronaviruses may also have left a residue of clinical evidence in the 19th century medical literature and popular press.  In this paper, we search digitised 19th century British newspapers for evidence of previously unsuspected coronavirus pandemics.  We conclude that there is little or no corpus linguistic signal in the UK national press for large-scale outbreaks of unidentified respiratory disease for the period 1785 to 1890.


The data file is a spreadsheet used to record queries made via CQPweb (

Search Terms

For clarity, in the ensuing descriptions, we use bold font for search terms and italic font for collocates and other quotations.

Based on clinical descriptions of COVID-19 (reviewed by Cevik et al., 2020), we identified the following search terms: 1) “cough”, 2) “fever”, 3) “pneumonia”.  To avoid confusion with years when influenza pandemics may have occurred, we added 4) “influenza” and 5) “epidemic”.  Any combination of terms 1 to 3 co-occurring with term 4 alone or terms 4 and 5 together, would be indicative of a respiratory outbreak caused by, or at the least attributed to, influenza.  By contrast, any combination of terms 1 to 3 co-occurring with term 5 alone, or without either of terms 4 and 5, would suggest a respiratory disease that was not confidently identified as influenza at the time.  This outbreak would provide a candidate coronavirus epidemic for further investigation.


Newspapers and years searched were as follows: Belfast Newsletter (1828-1900), The Era (1838-1900), Glasgow Herald (1820-1900), Hampshire & Portsmouth Telegraph (1799-1900), Ipswich Journal (1800-1900), Liverpool Mercury (1811-1900), Northern Echo (1870-1900) Pall Mall Gazette (1865-1900), Reynold’s Daily (1850-1900), Western Mail (1869-1900) and The Times (1785-2009). The search in The Times was extended to 2009 in order to provide a comparison with the 20th century.

Searches were performed using Lancaster University’s instance of the CQPweb (Corpus Query Processor) corpus analysis software (; Hardie, 2012). CQPweb’s database is populated from the newspapers listed, using optical character recognition (OCR), so for older publications in particular, some errors may be present (McEnery et al., 2019).


The occurrence of each of the five search terms was calculated per million words within the annual output of each publication, in CQPweb.  This is compared to a background distribution constituting the corresponding words per million for each search term over the total year range for each newspaper.  Within the annual distributions, for each search term and each newspaper, we determined the years lying in the top 1% (i.e. p<0.05 after application of a Bonferroni correction), following Gabrielatos et al. (2012).  These are deemed to be years when that search term was in statistically significant usage above its background level for the newspaper in which it occurs.  For years when search terms were significantly elevated, we also calculated collocates at range n.  Collocates, in corpus linguistics, are other words found at statistically significant usage, over their own background levels, in a window from n positions to the left to n positions to the right of the search term.  In other words, they are found in significant proximity to the search term.  A default value of n=10 was used throughout, unless specified.  Collocation analysis therefore assists in showing how a search term associates with other words within a corpus, providing information about the context in which that search term is used.  CQPweb provides a log ratio method for the quantification of the strength of collocation. 

Usage notes

To view data, open in Microsoft Excel.  To reproduce the data from scratch, a login is needed to CQPweb (  This is free of charge but requires authorization, which can be applied for at the URL given.