Advancing provenance assignment using machine learning and time series analysis of chemical chronologies in archival tissues
Data files
Jan 28, 2026 version files 327.36 KB
-
01_cross_validation_2025-10-27.Rmd
50.40 KB
-
02_unknown_juvenile_assignment_2025-10-27.Rmd
17.13 KB
-
03_adult_assignment_2025-10-27.Rmd
29.57 KB
-
04_natal_comparison_2025-10-27.Rmd
20.99 KB
-
adult_data_2025-07-14.csv
16.61 KB
-
chipps_juvenile_data_2025-07-14.csv
32.24 KB
-
natal_data_2025-07-14.csv
154.32 KB
-
README.md
6.10 KB
Abstract
Accurately assigning the provenance of organisms is critical for understanding ecological connectivity and guiding effective conservation and management. Natural chemical chronologies stored in metabolically inert, incrementally growing tissues (e.g., otoliths) provide a powerful tool for this purpose. However, traditional approaches face biological challenges (e.g., dispersal, maternal effects), collapse chronological data into summary metrics, and rely on subjective interpretation—limiting their accuracy and scalability. We present a novel, flexible framework that integrates machine learning, time-series analysis, and ensemble modeling to improve provenance assignment from archival tissue chemistry. Using otolith 87Sr/86Sr profiles from 17 natal sources of California Central Valley Chinook salmon (Oncorhynchus tshawytscha), we moved beyond conventional summary-based methods by developing fully automated time-series feature extraction, explicit time-series classification (including dynamic time warping [DTW] with k-nearest neighbors [KNN]), and ensemble models that combine multiple classifiers. We further incorporated simulated data to represent under-sampled life history strategies and validated the framework on real-world, known- and unknown-origin samples. Time-series-based approaches consistently outperformed traditional methods, particularly for sources with strong maternal signatures or early dispersal. Feature extraction approaches informed by biological knowledge were most effective when chemical chronologies followed predictable life-stage-specific patterns, whereas explicit time-series classification (DTW + KNN) excelled when sources displayed distinct overall profile “shapes.” Ensemble models leveraged the complementary strengths of individual approaches, outperforming any single method. Incorporating simulated data corrected systematic underrepresentation of key life history phenotypes in real-world applications, improving model performance, population composition estimates, and their relevance for management decisions. Our results highlight the power of treating archival chemical data as time series, combined with machine learning and ensemble strategies, to enhance the accuracy, consistency, and scalability of provenance assignment. This flexible framework is broadly applicable across taxa, tissues, and chemical markers, offering a practical roadmap for advancing ecological inference and informing conservation and management.
Dataset DOI: 10.5061/dryad.3bk3j9kz3
Description of the data and file structure
This README was generated on 2025-10-27 by K Arai.
Author (Name, Institution, Email):
K. Arai, University of California Davis, kharai@ucdavis.edu
This README is in reference to the data sets and R codes (.Rmd) needed to recreate the figures and analyses in the manuscript “Advancing Provenance Assignment using Machine Learning and Time Series Analysis of Chemical Chronologies in Archival Tissues”, submitted by K Arai.
There are four major analyses within the manuscript. The first script 01_cross_validation_2025-10-27.Rmd compares the performance of different assignment models through cross-validation. The second script 02_unknown_juvenile_assignment_2025-10-27.Rmd assigns natal origin to real-world, wild unknown-origin juveniles (n = 50) collected at Chipps Island from 2014 to 2021. The third script 03_adult_assignment_2025-10-27.Rmd evaluates the benefits of including synthetic early migrant data in the training dataset using known-origin adult samples. The fourth script 04_natal_comparison_2025-10-27.Rmd evaluates the subjectivity involved in user-defined natal region assignments and how these factors might influence subsequent natal assignment results.
Data list
natal_data_2025-07-14.csvchipps_juvenile_data_2025-07-14.csvadult_data_2025-07-14.csv
Script list
01_cross_validation_2025-10-27.Rmd02_unknown_juvenile_assignment_2025-10-27.Rmd03_adult_assignment_2025-10-27.Rmd04_natal_comparison_2025-10-27.Rmd
Files and variables
Known-origin juvenile Chinook salmon otolith 87Sr/86Sr reference data (natal_data_2025-07-14.csv)
This dataset contains otolith 87Sr/86Sr measurements from known-origin juvenile Chinook salmon, representing 17 distinct natal sources. These include both empirical observations and simulated synthetic data for early migrants. Missing or unavailable values are indicated by NA.
- Number of variables: 40
- Number of cases/rows: 255
- Variable list:
row_id: Unique row identifier (string)
random_id: Unique row number (integer)
Sample_ID: Sample identifier (string)
NAT_LOC: Natal source ID (categorical)
HvW: Hatchery or wild origin (categorical)
natal_start_rd1, natal_start_rd2, natal_start_rd3: Natal region start distance (µm from otolith core) as identified by reader 1, 2, and 3, respectively
med_sr_rd1, med_sr_rd2, med_sr_rd3: Median 87Sr/86Sr value within natal region identified by reader 1, 2, and 3, respectively
med_sr_auto: Median 87Sr/86Sr value within a fixed natal region (245–320 µm)
min_sr_auto, max_sr_auto, range_sr_auto: Minimum, maximum, and range of 87Sr/86Sr value within a fixed region (160–320 µm)
X100, X110, X120,…, X320: Splined 87Sr/86Sr value at each distance from the otolith core (10 µm intervals)
mean_exog: Mean exogenous feeding check score
type: Data type (empirical or synthetic)
Wild unknown-origin juvenile Chinook salmon otolith 87Sr/86Sr data (chipps_juvenile_data_2025-07-14.csv)
This dataset contains otolith 87Sr/86Sr measurements and meta data for wild, unknown-origin juvenile Chinook salmon collected at Chipps Island from 2014 to 2021. Missing or unavailable values are indicated by NA.
- Number of variables: 44
- Number of cases/rows: 50
- Variable list:
row_id: Unique row identifier (string)
random_id: Unique row number (integer)
Sample_ID: Sample identifier (string)
NAT_LOC: Natal source ID (categorical)
HvW: Hatchery or wild origin (categorical)
natal_start_rd1, natal_start_rd2, natal_start_rd3: Natal region start distance (µm from otolith core) as identified by reader 1, 2, and 3, respectively
med_sr_rd1, med_sr_rd2, med_sr_rd3: Median 87Sr/86Sr value within natal region identified by reader 1, 2, and 3, respectively
med_sr_auto: Median 87Sr/86Sr value within a fixed natal region (245–320 µm)
min_sr_auto, max_sr_auto, range_sr_auto: Minimum, maximum, and range of 87Sr/86Sr value within a fixed region (160–320 µm)
X100, X110, X120,…, X320: Splined 87Sr/86Sr value at each distance from the otolith core (10 µm intervals)
mean_exog: Mean exogenous feeding check score
type: Data type (empirical or synthetic)
WY: Water year at collection
Date_sampled: fish collection date
FL: Fish fork length in mm at collection
Site_name: Collection site name (Chipps Island)
Known-origin adult Chinook salmon otolith 87Sr/86Sr data (adult_data_2025-07-14.csv)
This dataset contains otolith 87Sr/86Sr measurements data for wild, known-origin adult Chinook salmon returns collected in the American, Stanislaus, and Yuba Rivers. Missing or unavailable values are indicated by NA.
- Number of variables: 33
- Number of cases/rows: 30
- Variable list:
row_id: Unique row identifier (string)
Sample_ID: Sample identifier (string)
NAT_LOC: Natal source ID (categorical)
X100, X110, X120,…, X320: Splined 87Sr/86Sr value at each distance from the otolith core (10 µm intervals)
med_sr_rd1: Median 87Sr/86Sr value within natal region identified by reader 1
med_sr_auto: Median 87Sr/86Sr value within a fixed natal region (245–320 µm)
min_sr_auto, max_sr_auto, range_sr_auto: Minimum, maximum, and range of 87Sr/86Sr value within a fixed region (160–320 µm)
type: Data type (adult)
mean_exog: Mean exogenous feeding check score
Code/Software
All analyses were conducted in R. The output assumes that the .Rmd scripts are placed in a scripts folder along with a data folder that contains the supplied data, a figures folder that is now empty but will contain output figures, and a outputs folder that is now empty but will contain outputs of the analyses.
