Data from: COPD screening using time–frequency features of self-recorded respiratory sounds

Tena, Alberto1; Juez-Garcia, Ivan2; Benítez, Iván3; Clariá, Francesc3; González, Jessica4; de Batlle, Jordi 2 ; Solsona, Francesc3

Published Aug 11, 2025 on Dryad. https://doi.org/10.5061/dryad.v41ns1s8g

Data files

Aug 11, 2025 version files 1.03 MB

README.md

6.55 KB
record_data.csv

1.02 MB

Abstract

Chronic obstructive pulmonary disease (COPD) is the third leading cause of death worldwide, with up to 70% of cases remaining undiagnosed. This paper proposes a COPD screening tool based on time-frequency representation features of self-recorded respiratory sounds. Respiratory sound samples (breath and cough sounds) were extracted from COPD and asymptomatic non-COPD volunteers using a large, scientific-purpose database. We analysed 39 time-frequency representation features of breath and cough sounds, combined with age, sex, and smoking status, using Autoencoder neural networks and random forest algorithms. We compared the performance of different breath and cough random forest models built to detect COPD: based exclusively on sound features, based exclusively on sociodemographic characteristics, and based on sound features and sociodemographic characteristics. Models including breathing features outperformed models exclusively based on sociodemographic characteristics. Specifically, the model combining sociodemographic characteristics and breathing features achieved an AUC, accuracy, sensitivity, and specificity of 0.901, 0.836, 0.871, and 0.761, respectively, in the test set, representing a substantial increase in AUC when compared to the model based exclusively on sociodemographic characteristics (0.901 vs. 0.818). Our results suggest that a lightweight collection of the time-frequency representation features of self-recorded breathing sounds could effectively improve the predictive performance of COPD screening or case-finding questionnaires. COPD screening through self-recorded breathing sounds could be easily integrated as a low-cost first step in case-finding programs, potentially contributing to mitigate COPD underdiagnosis.

Dataset DOI: 10.5061/dryad.v41ns1s8g

Description of the data and file structure

This dataset, record_data.csv, is a new collection of time-frequency features derived from cough and respiration sound samples. The original sound samples were obtained from the proprietary Covid-19 Sounds database, a resource developed by Professor Cecilia Mascolo and her team in the Department of Computer Science and Technology at Cambridge University. We were granted permission to compute these features and share the resulting dataset. While the original sound data and associated sociodemographic information cannot be shared here due to proprietary rights held by Cambridge University, the data included in this repository can be used freely. For additional information about the original Covid-19 Sounds database or to request access to the raw sound data, please refer to the provided reference [1] and contact the University of Cambridge directly.

The dataset is provided as a comma-separated values (CSV) file. Please note that any cells with a "null" value indicate missing data and were excluded from our analysis.

[1] Tong Xia^{, Dimitris Spathis}, Chloë Brown*, Jagmohan Chauhan*, Andreas Grammenos*, Jing Han*, Apinan Hasthanasombat*, Erika Bondareva*, Ting Dang*, Andres Floto, Pietro Cicuta, Cecilia Mascolo. "COVID-19 Sounds: A Large-Scale Audio Dataset for Digital Respiratory Screening." Proceedings of the 35th Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS 2021) [PDF].
^joint first authors, *equal contribution, alphabetical order

Files and variables

File: record_data.csv

Description:

Variables

f_Cres1: average of the Instantaneous Frequency Peak, calculated in the 0-80Hz frequency band.
f_Cres2: average of the Instantaneous Frequency Peak, calculated in the 80-250Hz frequency band.
f_Cres3: average of the Instantaneous Frequency Peak, calculated in the 250-550Hz frequency band.
f_Cres4: average of the Instantaneous Frequency Peak, calculated in the 550-900Hz frequency band.
f_Cres5: average of the Instantaneous Frequency Peak, calculated in the 900-1500Hz frequency band.
f_Cres6: average of the Instantaneous Frequency Peak, calculated in the 1500-3000Hz frequency band.
f_Cres7: average of the Instantaneous Frequency Peak, calculated in the 3000-4000Hz frequency band.
Enr_Bn1: average of the instantaneous spectral energy of each sample, calculated in the 0-80Hz frequency band.
Enr_Bn2: average of the instantaneous spectral energy of each sample, calculated in the 80-250Hz frequency band.
Enr_Bn3: average of the instantaneous spectral energy of each sample, calculated in the 250-550Hz frequency band.
Enr_Bn4: average of the instantaneous spectral energy of each sample, calculated in the 550-900Hz frequency band.
Enr_Bn5: average of the instantaneous spectral energy of each sample, calculated in the 900-1500Hz frequency band.
Enr_Bn6: average of the instantaneous spectral energy of each sample, calculated in the 1500-3000Hz frequency band.
Enr_Bn7: average of the instantaneous spectral energy of each sample, calculated in the 3000-4000Hz frequency band.
f_Med1: average of the instantaneous frequency, calculated in the 0-80Hz frequency band.
f_Med2: average of the instantaneous frequency, calculated in the 80-250Hz frequency band.
f_Med3: average of the instantaneous frequency, calculated in the 250-550Hz frequency band.
f_Med4: average of the instantaneous frequency, calculated in the 550-900Hz frequency band.
f_Med5: average of the instantaneous frequency, calculated in the 900-1500Hz frequency band.
f_Med6: average of the instantaneous frequency, calculated in the 1500-3000Hz frequency band.
f_Med7: average of the instantaneous frequency, calculated in the 3000-4000Hz frequency band.
IE_Bn1: average of the spectral information, calculated in the 0-80Hz frequency band.
IE_Bn2: average of the spectral information, calculated in the 80-250Hz frequency band.
IE_Bn3: average of the spectral information, calculated in the 250-550Hz frequency band.
IE_Bn4: average of the spectral information, calculated in the 550-900Hz frequency band.
IE_Bn5: average of the spectral information, calculated in the 900-1500Hz frequency band.
IE_Bn6: average of the spectral information, calculated in the 1500-3000Hz frequency band.
IE_Bn7: average of the spectral information, calculated in the 3000-4000Hz frequency band.
H_tf: joint Shannon entropy in a range of 0 to 20 bits.
H_t: instantaneous entropy.
H_f: spectral entropy.
fm: average of the instantaneous frequencies across the entire spectrum.
k: kurtosis.
MomC_11: joint time-frequency moments (n = 1 and m = 1)
MomC_77: joint time-frequency moments (n = 7 and m = 7)
MomC_1515: joint time-frequency moments (n = 15 and m = 15)
MomM_11: joint moments of the marginal signals of the instantaneous power and spectral density (n = 1 and m = 1)
MomM_77: joint moments of the marginal signals of the instantaneous power and spectral density (n = 7 and m = 7)
MomM_1515: joint moments of the marginal signals of the instantaneous power and spectral density (n = 15 and m = 15)

Further details of the time-frequency features in the dataset can be found in the Online Supplementary file of the associated publication titled "COPD Screening Using Time–Frequency Features of Self-Recorded Respiratory Sounds" published in JAMIA Open.

Access information

Data was derived from the following sources:

Tong Xia^{, Dimitris Spathis}, Chloë Brown*, Jagmohan Chauhan*, Andreas Grammenos*, Jing Han*, Apinan Hasthanasombat*, Erika Bondareva*, Ting Dang*, Andres Floto, Pietro Cicuta, Cecilia Mascolo. "COVID-19 Sounds: A Large-Scale Audio Dataset for Digital Respiratory Screening." Proceedings of the 35th Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS 2021) [PDF].
^joint first authors, *equal contribution, alphabetical order

Human subjects data

All human subject data included in this dataset have been de-identified in accordance with relevant privacy regulations. Identifying information has been removed or masked. No data in this dataset can reasonably be used to identify individuals.