AI and paleontology: Effects of vertebrate fossil sample size on machine learning image classification
Data files
Jan 30, 2026 version files 1.04 MB
-
1_README_1.29.2026.pdf
184.39 KB
-
2_Complete_study_image_data_set.pdf
96.55 KB
-
3ab_Optimal_pixel_density_context_and_statistical_results.pdf
187.50 KB
-
3c_Optimal_pixel_density_ML_model_results.csv
479 B
-
4_Original_Python-Tensorflow_files_including_model_code.pdf
98.74 KB
-
5a_Optimal_performance_statistical_results.pdf
154.39 KB
-
5b_Database_optimal_perf_ML_results_WO_fine_tuning.csv
1.85 KB
-
5c_Database_optimal_perf_ML_results_W_fine_tuning.csv
1.88 KB
-
5d_Variance_dataset.csv
508 B
-
6a_Data_augmentation_statisical_results_.pdf
119.72 KB
-
6b_Data_Augmentation-FT_model__RG-HF.csv
2.46 KB
-
7_Misidentification_analysis.pdf
96.91 KB
-
8_GBIF_plots_of_fossil_Lamnidae_and_Carcharhinidae.pdf
72.38 KB
-
9_SharkAI_R_code_2025_sections_356.R
20.33 KB
-
README.md
1.34 KB
Abstract
With the growing application of artificial intelligence (AI) and machine learning (ML), great potential exists to leverage these technologies in paleontology. Relative to many other scientific fields, a challenge of ML applied to paleontology is small sample sizes, particularly for fossil vertebrates. Shark teeth, abundant in the fossil record, provide a model system to use ML across varying sample sizes. Here we use six classes (taxa) of Neogene shark teeth for taxonomic identification, including a curated dataset of 3150 images. Each class was evaluated using an 80% training and 20% validation split, with a separate, external test set of 25 samples per class. Pretrained models perform well (accuracy > 90%), providing a strong baseline for classification. However, enabling fine-tuning of the ML model to identify fossil shark teeth improves performance considerably. Likewise, sample size per class also affects the accuracy of the models’ classifications. Smaller sample sizes (n = 50 individuals per class) yielded a mean accuracy of 93.4%, but plateaued at ~99% between 200 and 500 images per class. Confidence likewise increases with larger samples, from 81.8% (n = 50 individuals per class) to >90% (n = 300 to 500 individuals per class). Misidentifications followed consistent patterns, reflecting morphological similarities and/or poor preservation. Artificially increasing the training datasets using data augmentation improves the confidence of identifications. This research indicates that relatively small samples of vertebrate species (~50 to 500 individuals per class) can effectively train an ML model to identify these shark teeth with high levels of accuracy.
Dataset DOI: 10.5061/dryad.zpc866tpq
Description of the data and file structure
A complete README file, formatted as a .pdf is included in the file 1 README 1.28.26.pdf. The other files are listed as follows:
- 3c_Optimal_pixel_density_ML_model_results.csv
- 5b_Database_optimal_perf_ML_results_WO_fine_tuning.csv
- 5c_Database_optimal_perf_ML_results_W_fine_tuning.csv
- 5d_Variance_dataset.csv
- 6b_Data_Augmentation-FT_model__RG-HF.csv
- 9_SharkAI_R_code_2025_sections_356.R
- 1_README_1.29.2026.pdf
- 2_Complete_study_image_data_set.pdf
- 3ab_Optimal_pixel_density_context_and_statistical_results.pdf
- 4_Original_Python-Tensorflow_files_including_model_code.pdf
- 5a_Optimal_performance_statistical_results.pdf
- 6a_Data_augmentation_statisical_results_.pdf
- 7_Misidentification_analysis.pdf
- 8_GBIF_plots_of_fossil_Lamnidae_and_Carcharhinidae.pdf
Files and variables
Files and variables are contained in the 1 README 1.28.26 file and other files in the submission above.
Code/software
Code/software are contained in the 1 README 1.28.26 file submission above.
Access information
Access information are contained in the 1 README 1.28.26 file submission above.
