Advancing provenance assignment using machine learning and time series analysis of chemical chronologies in archival tissues

Arai, Kohma 1 ; Willmes, Malte2; Johnson, Rachel1 3; Sturrock, Anna4

Published Jan 28, 2026 on Dryad. https://doi.org/10.5061/dryad.3bk3j9kz3

Data files

Jan 28, 2026 version files 327.36 KB

01_cross_validation_2025-10-27.Rmd

50.40 KB
02_unknown_juvenile_assignment_2025-10-27.Rmd

17.13 KB
03_adult_assignment_2025-10-27.Rmd

29.57 KB
04_natal_comparison_2025-10-27.Rmd

20.99 KB
adult_data_2025-07-14.csv

16.61 KB
chipps_juvenile_data_2025-07-14.csv

32.24 KB
natal_data_2025-07-14.csv

154.32 KB
README.md

6.10 KB

Abstract

Accurately assigning the provenance of organisms is critical for understanding ecological connectivity and guiding effective conservation and management. Natural chemical chronologies stored in metabolically inert, incrementally growing tissues (e.g., otoliths) provide a powerful tool for this purpose. However, traditional approaches face biological challenges (e.g., dispersal, maternal effects), collapse chronological data into summary metrics, and rely on subjective interpretation—limiting their accuracy and scalability. We present a novel, flexible framework that integrates machine learning, time-series analysis, and ensemble modeling to improve provenance assignment from archival tissue chemistry. Using otolith ⁸⁷Sr/⁸⁶Sr profiles from 17 natal sources of California Central Valley Chinook salmon (Oncorhynchus tshawytscha), we moved beyond conventional summary-based methods by developing fully automated time-series feature extraction, explicit time-series classification (including dynamic time warping [DTW] with k-nearest neighbors [KNN]), and ensemble models that combine multiple classifiers. We further incorporated simulated data to represent under-sampled life history strategies and validated the framework on real-world, known- and unknown-origin samples. Time-series-based approaches consistently outperformed traditional methods, particularly for sources with strong maternal signatures or early dispersal. Feature extraction approaches informed by biological knowledge were most effective when chemical chronologies followed predictable life-stage-specific patterns, whereas explicit time-series classification (DTW + KNN) excelled when sources displayed distinct overall profile “shapes.” Ensemble models leveraged the complementary strengths of individual approaches, outperforming any single method. Incorporating simulated data corrected systematic underrepresentation of key life history phenotypes in real-world applications, improving model performance, population composition estimates, and their relevance for management decisions. Our results highlight the power of treating archival chemical data as time series, combined with machine learning and ensemble strategies, to enhance the accuracy, consistency, and scalability of provenance assignment. This flexible framework is broadly applicable across taxa, tissues, and chemical markers, offering a practical roadmap for advancing ecological inference and informing conservation and management.

Dataset DOI: 10.5061/dryad.3bk3j9kz3

Description of the data and file structure

This README was generated on 2025-10-27 by K Arai.

Author (Name, Institution, Email):
K. Arai, University of California Davis, kharai@ucdavis.edu

This README is in reference to the data sets and R codes (.Rmd) needed to recreate the figures and analyses in the manuscript “Advancing Provenance Assignment using Machine Learning and Time Series Analysis of Chemical Chronologies in Archival Tissues”, submitted by K Arai.

There are four major analyses within the manuscript. The first script 01_cross_validation_2025-10-27.Rmd compares the performance of different assignment models through cross-validation. The second script 02_unknown_juvenile_assignment_2025-10-27.Rmd assigns natal origin to real-world, wild unknown-origin juveniles (n = 50) collected at Chipps Island from 2014 to 2021. The third script 03_adult_assignment_2025-10-27.Rmd evaluates the benefits of including synthetic early migrant data in the training dataset using known-origin adult samples. The fourth script 04_natal_comparison_2025-10-27.Rmd evaluates the subjectivity involved in user-defined natal region assignments and how these factors might influence subsequent natal assignment results.

Data list

natal_data_2025-07-14.csv
chipps_juvenile_data_2025-07-14.csv
adult_data_2025-07-14.csv

Script list

01_cross_validation_2025-10-27.Rmd
02_unknown_juvenile_assignment_2025-10-27.Rmd
03_adult_assignment_2025-10-27.Rmd
04_natal_comparison_2025-10-27.Rmd

Files and variables

Known-origin juvenile Chinook salmon otolith ⁸⁷Sr/⁸⁶Sr reference data (`natal_data_2025-07-14.csv`)

This dataset contains otolith ⁸⁷Sr/⁸⁶Sr measurements from known-origin juvenile Chinook salmon, representing 17 distinct natal sources. These include both empirical observations and simulated synthetic data for early migrants. Missing or unavailable values are indicated by NA.

Number of variables: 40
Number of cases/rows: 255
Variable list:

row_id: Unique row identifier (string)

random_id: Unique row number (integer)

Sample_ID: Sample identifier (string)

NAT_LOC: Natal source ID (categorical)

HvW: Hatchery or wild origin (categorical)

natal_start_rd1, natal_start_rd2, natal_start_rd3: Natal region start distance (µm from otolith core) as identified by reader 1, 2, and 3, respectively

med_sr_rd1, med_sr_rd2, med_sr_rd3: Median ⁸⁷Sr/⁸⁶Sr value within natal region identified by reader 1, 2, and 3, respectively

med_sr_auto: Median ⁸⁷Sr/⁸⁶Sr value within a fixed natal region (245–320 µm)

min_sr_auto, max_sr_auto, range_sr_auto: Minimum, maximum, and range of ⁸⁷Sr/⁸⁶Sr value within a fixed region (160–320 µm)

X100, X110, X120,…, X320: Splined ⁸⁷Sr/⁸⁶Sr value at each distance from the otolith core (10 µm intervals)

mean_exog: Mean exogenous feeding check score

type: Data type (empirical or synthetic)

Wild unknown-origin juvenile Chinook salmon otolith ⁸⁷Sr/⁸⁶Sr data (`chipps_juvenile_data_2025-07-14.csv`)

This dataset contains otolith ⁸⁷Sr/⁸⁶Sr measurements and meta data for wild, unknown-origin juvenile Chinook salmon collected at Chipps Island from 2014 to 2021. Missing or unavailable values are indicated by NA.

Number of variables: 44
Number of cases/rows: 50
Variable list:

row_id: Unique row identifier (string)

random_id: Unique row number (integer)

Sample_ID: Sample identifier (string)

NAT_LOC: Natal source ID (categorical)

HvW: Hatchery or wild origin (categorical)

natal_start_rd1, natal_start_rd2, natal_start_rd3: Natal region start distance (µm from otolith core) as identified by reader 1, 2, and 3, respectively

med_sr_rd1, med_sr_rd2, med_sr_rd3: Median ⁸⁷Sr/⁸⁶Sr value within natal region identified by reader 1, 2, and 3, respectively

med_sr_auto: Median ⁸⁷Sr/⁸⁶Sr value within a fixed natal region (245–320 µm)

min_sr_auto, max_sr_auto, range_sr_auto: Minimum, maximum, and range of ⁸⁷Sr/⁸⁶Sr value within a fixed region (160–320 µm)

X100, X110, X120,…, X320: Splined ⁸⁷Sr/⁸⁶Sr value at each distance from the otolith core (10 µm intervals)

mean_exog: Mean exogenous feeding check score

type: Data type (empirical or synthetic)

WY: Water year at collection

Date_sampled: fish collection date

FL: Fish fork length in mm at collection

Site_name: Collection site name (Chipps Island)

Known-origin adult Chinook salmon otolith ⁸⁷Sr/⁸⁶Sr data (`adult_data_2025-07-14.csv`)

This dataset contains otolith ⁸⁷Sr/⁸⁶Sr measurements data for wild, known-origin adult Chinook salmon returns collected in the American, Stanislaus, and Yuba Rivers. Missing or unavailable values are indicated by NA.

Number of variables: 33
Number of cases/rows: 30
Variable list:

row_id: Unique row identifier (string)

Sample_ID: Sample identifier (string)

NAT_LOC: Natal source ID (categorical)

X100, X110, X120,…, X320: Splined ⁸⁷Sr/⁸⁶Sr value at each distance from the otolith core (10 µm intervals)

med_sr_rd1: Median ⁸⁷Sr/⁸⁶Sr value within natal region identified by reader 1

med_sr_auto: Median ⁸⁷Sr/⁸⁶Sr value within a fixed natal region (245–320 µm)

min_sr_auto, max_sr_auto, range_sr_auto: Minimum, maximum, and range of ⁸⁷Sr/⁸⁶Sr value within a fixed region (160–320 µm)

type: Data type (adult)

mean_exog: Mean exogenous feeding check score

Code/Software

All analyses were conducted in R. The output assumes that the .Rmd scripts are placed in a scripts folder along with a data folder that contains the supplied data, a figures folder that is now empty but will contain output figures, and a outputs folder that is now empty but will contain outputs of the analyses.

Advancing provenance assignment using machine learning and time series analysis of chemical chronologies in archival tissues

Data files

Abstract

README: Advancing provenance assignment using machine learning and time series analysis of chemical chronologies in archival tissues

Description of the data and file structure

Files and variables

Known-origin juvenile Chinook salmon otolith 87Sr/86Sr reference data (natal_data_2025-07-14.csv)

Wild unknown-origin juvenile Chinook salmon otolith 87Sr/86Sr data (chipps_juvenile_data_2025-07-14.csv)

Known-origin adult Chinook salmon otolith 87Sr/86Sr data (adult_data_2025-07-14.csv)

Code/Software

Known-origin juvenile Chinook salmon otolith ⁸⁷Sr/⁸⁶Sr reference data (`natal_data_2025-07-14.csv`)

Wild unknown-origin juvenile Chinook salmon otolith ⁸⁷Sr/⁸⁶Sr data (`chipps_juvenile_data_2025-07-14.csv`)

Known-origin adult Chinook salmon otolith ⁸⁷Sr/⁸⁶Sr data (`adult_data_2025-07-14.csv`)