Skip to main content
Dryad

Advancing provenance assignment using machine learning and time series analysis of chemical chronologies in archival tissues

Data files

Jan 28, 2026 version files 327.36 KB

Click names to download individual files

Abstract

Accurately assigning the provenance of organisms is critical for understanding ecological connectivity and guiding effective conservation and management. Natural chemical chronologies stored in metabolically inert, incrementally growing tissues (e.g., otoliths) provide a powerful tool for this purpose. However, traditional approaches face biological challenges (e.g., dispersal, maternal effects), collapse chronological data into summary metrics, and rely on subjective interpretation—limiting their accuracy and scalability. We present a novel, flexible framework that integrates machine learning, time-series analysis, and ensemble modeling to improve provenance assignment from archival tissue chemistry. Using otolith 87Sr/86Sr profiles from 17 natal sources of California Central Valley Chinook salmon (Oncorhynchus tshawytscha), we moved beyond conventional summary-based methods by developing fully automated time-series feature extraction, explicit time-series classification (including dynamic time warping [DTW] with k-nearest neighbors [KNN]), and ensemble models that combine multiple classifiers. We further incorporated simulated data to represent under-sampled life history strategies and validated the framework on real-world, known- and unknown-origin samples. Time-series-based approaches consistently outperformed traditional methods, particularly for sources with strong maternal signatures or early dispersal. Feature extraction approaches informed by biological knowledge were most effective when chemical chronologies followed predictable life-stage-specific patterns, whereas explicit time-series classification (DTW + KNN) excelled when sources displayed distinct overall profile “shapes.” Ensemble models leveraged the complementary strengths of individual approaches, outperforming any single method. Incorporating simulated data corrected systematic underrepresentation of key life history phenotypes in real-world applications, improving model performance, population composition estimates, and their relevance for management decisions. Our results highlight the power of treating archival chemical data as time series, combined with machine learning and ensemble strategies, to enhance the accuracy, consistency, and scalability of provenance assignment. This flexible framework is broadly applicable across taxa, tissues, and chemical markers, offering a practical roadmap for advancing ecological inference and informing conservation and management.