Data from: A cost-effective blood DNA methylation-based age estimation method in domestic cats, Tsushima leopard cats (Prionailurus bengalensis euptilurus), and Panthera species, using targeted bisulfite sequencing and machine learning models
Data files
Jan 02, 2024 version files 1.87 MB
Abstract
Knowledge of individual age can help both in-situ and ex-situ conservation programs to design more efficient and suitable management plans for targeted wildlife species. DNA methylation is one of the epigenetic aging markers that has emerged as a promising tool that can estimate age with high accuracy using only a tiny amount of biological material, which can be collected in a minimally invasive way. Here, we sequenced five targeted genetic regions and used 8–23 selected CpG sites to build age estimation models with machine learning methods with about only $3–7 per sample, using blood samples of seven Felidae species—ranging from small to big, and domestic to endangered species: domestic cats (Felis catus, 139 samples), Tsushima leopard cats (Prionailurus bengalensis euptilurus, 84 samples), and five Panthera species (96 samples). The models built achieved satisfactory accuracy—the mean absolute error of the best models was 1.966, 1.348, and 1.552 years in domestic cats, Tsushima leopard cats, and Panthera spp., respectively. Our models in domestic cats and Tsushima leopard cats were applicable to individuals regardless of health conditions, indicating the high applicability of our models to samples collected from diverse situations, e.g., rescued individuals in the context of conservation. We also showed the possibility of developing universal age estimation models for the five Panthera spp. using two of the five genetic regions, suggesting an even lower cost to use our models for future applications.
README: Datasets
Appendix S1–S4
The files included the methylation data, sample information, and predicted age of each target species/species group. The data in the files are used to build age estimation models. 'domestic cat' in the filename means the file is for the domestic cat; 'leopard cat' means for the Tsushima leopard cat; 'panthera' means for the Panthera species (i.e., jaguar, leopard, lion, snow leopard, and tiger), and 'all' means for all the samples from all species.
Appendix S5
The file contains the CpG selection results for the best age estimation model of each species/species group, the frequency of being selected in elastic net feature selection of each CpG site, correlation coefficients between the methylation rate and chronological age of each CpG site, and NCBI sequence ID with position.
CpG No renamed fulllist_all felidae.csv
The file showed the list of CpGs, which were at least contained in one species.
M%+sampleinfo*.csv
These files are the version of Appendix S1–S4 before adding the predicted age.
indextable_skf_cor*.csv
Raw results of feature selection (correlation-based).
indextable_skf_loio_ela*.csv
Raw results of feature selection (elastic net-based, leave-one-individual-out cross-validation).
indextable_skf_loso(_raw)_ela*.csv
Raw results of feature selection (elastic net-based, leave-one-species-out cross-validation).
P.S. Appendix S1-S5 are referred to in our paper. Other files were only used in the analysis.
Description of the data sets and file structures
Appendix S1–S4, M%+sampleinfo*.csv
- amp3_,amp4_, amp8_, amp9_, and bs38_ in the head are the names of CpG sites. Columns with the heads showed the results of methylation rates. The proximal genes and positions in genomes could be referred to in Appendix S5 and CpG No renamed fulllist_all felidae.csv.
- Health_condition_ed: health condition at the time of sampling (good, diseased).
- Health_condition (Appendix S2–S4, species other than domestic cats): raw health condition data
- Health condition information in Appendix S1 (domestic cats):
- Health_condition_Healthy (column K): healthy sample Health_condition_CKD (column L): sample with chronic kidney disease Health_condition_Diabetes (column M): sample with diabetes Health_condition_Cancer (column N): sample with cancer Health_condition_DigestiveDisease (column O): sample with digestive diseases Health_condition_Others (column P): sample with other diseases
- Fold: data was split into five folds (0–4) with similar age and species distribution using stratified k-fold.
- Age_class: age class of each sample.
- Predictedage_*: age predicted through the methods below.
Feature selection methods | Regression methods | Column name (after 'Predictedage_') |
---|---|---|
---------elastic net------- | -------only once-------- | ela |
elastic net | elastic net | ela_ela |
elastic net | SVMr | ela_svmr |
cor ≥ 0.5 | elastic net | cor0_5_ela |
cor ≥ 0.7 | elastic net | cor0_7_ela |
cor ≥ 0.5 | SVMr | cor0_5_svmr |
cor ≥ 0.7 | SVMr | cos0_7_svmr |
- For Appendix S2 and M%+sampleinfo_leopardcat_paper_final_fold+ageclass.csv
- 'Age_stage_at_time_of_protection' shows the age stages estimated when the individuals were protected from morphological methods.
- 'Death_date' shows the death date. No data here means the individuals are still alive in 2023. This data was not used in the analysis.
- Empty cells mean no data. Captive-born individuals had no data in 'Age_stage_at_time_of_protection'. Wild-born individuals had no data in 'Age', 'Health_condition_ed','Fold', 'Age_class', which were only available for captive-born individuals with age known. The predicted epigenetic age was only calculated using the best model and summarized in 'Predictedage_ela_svmr'.
- For Appendix S3 and M%+sampleinfo_panthera_paper_final_fold+ageclass.csv, Appendix S4 and M%+sampleinfo_all_paper_final_fold+relative_ageclass.csv
- 'Predictedage_*_loso(_raw)' is age predicted under the model evaluation of leave-one-species-out-cross-validation.
- For Appendix S4
- 'Predictedage_* ' is the predicted relative age of each sample. 'Predictedage_*_chronoloical age' is the predicted chronological age under the best models.
- Empty cells mean no data. The summarizing standard for domestic cats and other species was different. Therefore, empty cells are in health condition-related columns.
Appendix S5, CpG No renamed fulllist_all felidae.csv
- Columns E to M showed whether the CpG sites existed in each species group. 0 means the CpG does not exist in the species; 1 means the CpG exists in the species. Panthera_spp. (column L) included species in column G–K (i.e. jaguar, leopard, lion, snow leopard, and tiger). All_spp. (column M) included all species.
Appendix S5
- Green, yellow, orange, and red columns represent different levels of correlation coefficients between methylation rates of selected CpG sites and chronological age. White columns are CpG sites that were not selected. Grey columns are CpG sites that did not exist in the species group.
- Columns named "Features in the best model (correlation_coefficient)—Elastic net + SVMr (frequency ≥ 4 or 5)" showed the correlation coefficient between the chronological age and the methylation rates of features (i.e., CpGs) used in the best models. Elastic net-based feature selection followed by regression using SVMr (Elastic net + SVMr) produced the best models for all species groups. For some species groups, CpGs selected over four times in all five training data sets (frequency≥4) constructed the explanatory variables of the best models; for others, CpGs selected in all five training data sets (frequency ≥ 5) constructed the explanatory variables of the best models.
Code/Software
2023_Qi_etal_paper Rscript.R was run in R 4.3.1.
2023_Qi_etal_Pythonscript.py was run in Python 3.8.8.