Encounter-conditioned photo-ID fusion (FinFriend code + data release)

Barnhill, Alexander 1 ; Towers, Jared2 3

Published Apr 21, 2026 on Dryad. https://doi.org/10.5061/dryad.pzgmsbd2n

Data files

Apr 21, 2026 version files 2.54 GB

all_data.csv

5.56 MB
core_id.csv

1.47 KB
pkl.tar.xz

517.64 MB
pkl.zip

2.01 GB
README.md

12.08 KB
splits.zip

13.42 MB

Abstract

FinFriend is an encounter-aware photo-identification fusion pipeline that refines per-image identity posteriors using two lightweight context terms learned from the training split: (i) global sighting priors and (ii) an encounter-conditioned co-occurrence (log-lift) context prior. The method is model-agnostic and operates as post-processing on classifier outputs. This dataset contains a two main archive files for splits (splits.zip) and the raw-logit output of each of the classifiers trained in the accompanying manuscript (pkl.tar.xz). The raw output is a .tar.xz file for compression purposes. No leakage invariant: co-occurrence artifacts (priors/loglift) are computed from TRAIN only, and the test split is the newest encounters (chronological split). For full documentation, configuration details (Hydra), and optional training code, see the included README.

Dataset Release

splits.zip contains all pre-generated content for each training split discussed in the associated manuscript.
pkl.tar.xz and pkl.zip contain the raw logit based output from each of the classifiers trained on the data splits described and detailed in splits.zip
all_data.csv contains an entry for each file including its label, encounter, date, location, and photographer (all pseudo-anonymized while retaining encounter ordering and image-encounter and animal-encounter distributions)
core_id.csv contains the list of IDs which occur in each training split.

Encounter-conditioned fusion for photo-ID (taxon-agnostic; case study: Bigg’s killer whales)

FinFriend is an encounter-aware photo-identification fusion pipeline. It combines:

a per-image ML classifier (trained in learning/), with
encounter context (co-occurrence structure learned from the training split), to
produce encounter-level identity predictions and evaluation reports.

The core idea is simple: if individuals co-occur non-randomly, then knowing “who tends to be seen with whom” can improve ambiguous photo-ID predictions when multiple images belong to the same encounter.

Review Demo (Reproduces manuscript results)

This release contains:

code/ — FinFriend source snapshot used for the paper. Hosted on Zenodo (MIT License)
splits/ — precomputed splits + TRAIN-only artifacts (priors, co-occurrence, loglift)
pkl/ — per-image classifier outputs (class map + logits/probabilities)

Run

In configs/paths.yaml, set root to the directory containing code/, splits/, and pkl/
Install deps: pip install -r requirements.txt
Run: python -m inference.assisted_predictor

What’s in this repo (and what isn’t)

Relevant

preparation/ — split construction + artefact generation (priors, co-occurrence, loglift, diagnostics).
learning/ — training and evaluating the ML classifier.
inference/ — fusion, weight selection, evaluation, and run orchestration.
configs/ — Hydra configs and resolvers.
data/ — input/split helpers and common utilities.
results/ — aggregation, plotting hooks, and result writing (used by inference runs).

Pipeline overview (the steps you actually run)

The end-to-end flow is:

Make split (and automatically generate priors + co-occurrence + loglift from TRAIN only)
Train ML classifier (image → identity probabilities)
Select fusion weights (tune how much context/priors should influence predictions)
Run / evaluate (apply fusion on holdout/test, export metrics and artifacts)

A key invariant: co-occurrence artifacts are built from the TRAIN split only, and the test split is the newest test_frac of encounters (chronological split), so there is no leakage from test into priors/co-occurrence/loglift.

Installation

1) Create an environment

Use any Python environment manager you like. For example:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2) Sanity check imports

python -c "import preparation, learning, inference; print('ok')"

Data expectations

FinFriend expects a flat table (CSV / DataFrame) with at least:

Encounter — encounter identifier (string).
Label (or Identifier/ID/Id) — true identity label (string).
Filepath — used for stats/diagnostics; strongly recommended.

Optional (but useful for reporting/diagnostics):

Date — per-image date (or something parseable). If not present, this repo can infer it from the encounter string in your example.
Photographer, Location — used for descriptive stats.

Core ID list (`core_ids.csv`)

Splitting can optionally filter to a “core set” of IDs (e.g., individuals that have adequate representation). In make_split, the core_ids DataFrame is used to restrict identities across splits.

Step 1 — Make a split (and build priors/co-occurrence/loglift)

This is the foundation of the pipeline: it creates train/val/holdout/test CSVs and automatically produces:

priors.csv — global ID priors from TRAIN
cooccurrence.csv — co-occurrence counts from TRAIN
loglift.csv — log-lift association scores derived from TRAIN
encounter_id_list.csv — encounter -> “IDs present” table
diagnostics JSON + plots (priors distributions, heatmaps, support scale)

What “train/val/holdout/test” mean here

Splits are by encounter (not by image), and are contiguous in encounter order:

train | val | holdout | test (newest N%)

This preserves temporal realism: test is the newest encounters.

Example: run make_split for multiple training fractions

Your current __main__ block in the split script already demonstrates this pattern:

fixed test_frac = 0.10
vary train_frac over {0.1, 0.3, 0.5, 0.7}
output to .../splits/train-10, train-30, etc.

A typical invocation pattern:

python preparation/make_split.py

If you prefer calling it as a module (depends on your packaging/import paths):

python -m preparation.make_split

Outputs written to `out_dir`

For each split directory (e.g., splits/train-30/) you should expect:

Split CSVs

train.csv, val.csv, holdout.csv, test.csv
encounter_data.csv (concatenated split rows, with columns normalized for downstream inference)

artifacts (built from TRAIN only)

priors.csv
cooccurrence.csv
loglift.csv
p1.csv, logp2.csv (intermediate probability terms used in loglift computation)
encounter_id_list.csv

Diagnostics

split_statistics.json
split_sequential_ok.json
support_diagnostics.json
priors_coocc_loglift_stats.json

Plots

plots/priors_hist.png, plots/priors_log_hist.png, plots/priors_topK.png
plots/coocc_topK_log1p.png, plots/loglift_topK_clip.png
plots/support_*.png (support histograms and relationships)

Step 2 — Train the ML classifier (`learning/`)

Training code is included for completeness; not required to reproduce paper results.

The code in learning contains modules and configuration files necessary to re-create
the deep learning, Efficient Net B0-based model used here.

Step 3 — Weight selection

This step chooses the fusion hyperparameters (weights) that determine how much the final prediction should trust:

the ML classifier probabilities,
global priors,
encounter context (loglift / co-occurrence support),
and any initialization strategy for “who might be in the encounter”.

In your tree, weight selection logic is supported by:

inference/assisted_predictor.py (recommended rename: weight_search.py or select_weights.py)
inference/weight_selection.py
inference/set_initialization.py
inference/fusion.py
inference/metrics.py, inference/evaluation.py

What it does

At a high level, weight selection:

Loads predictions from the trained ML model on some evaluation split (often holdout).
Runs fusion across a grid or set of candidate hyperparameters.
Scores each configuration (macro-F1, accuracy, logloss, Brier, etc.).
Writes out the best configuration(s) and a summary table.

Typical run pattern

You’ll generally provide:

the split directory containing loglift.csv, priors.csv, and encounter_data.csv
the model predictions file(s) to fuse
the strategy family to evaluate (e.g., random-known, top-unknown, null-set, etc.)

Example shape of a run command:

python inference/assisted_predictor.py \
  +paths.split_dir=/path/to/splits/train-30 \
  +paths.predictions_dir=/path/to/model/preds \
  +fusion.grid=default

Again, exact key names depend on your Hydra config; the repo’s intent is that these are controlled via:

configs/fusion.yaml
configs/results.yaml
configs/paths.yaml
configs/config.yaml

Output

Expect:

a “best weights” config (or JSON/YAML artifact)
a results table over the grid
logs and optionally plots for the selection sweep

Step 4 — Run inference / evaluate (`inference/runner.py`)

Once you have:

a split directory (with priors/loglift artifacts)
a trained ML classifier (or its predictions)
selected fusion weights

…you can run the full pipeline on holdout and/or test.

Where to look:

inference/runner.py — orchestration entrypoint
inference/encounter_inference.py — encounter-level inference logic
inference/fusion.py — fusion implementation
inference/evaluation.py — metrics and evaluation routines
inference/persistence.py — saving outputs
results/post_process.py, results/artifact_writer.py — result formatting and export

Typical run pattern

Example command shape:

python inference/runner.py \
  +paths.split_dir=/path/to/splits/train-30 \
  +paths.predictions_dir=/path/to/model/preds \
  +paths.weights=/path/to/selected_weights.yaml \
  +eval.split=test

Expected output

A run typically writes:

per-encounter predictions
per-image and aggregate metrics
summary tables for reporting
artifacts suitable for manuscript figures/tables (via results/ helpers)

Configuration guide (Hydra)

Configs live in configs/:

config.yaml — top-level composition
paths.yaml — where data/splits/predictions/results live
models.yaml — model selection
fusion.yaml — fusion hyperparameters and grids
results.yaml — output formatting and artifact writing
anonymize.yaml — dataset anonymization settings (if used)

Resolvers:

configs/resolvers.py — custom Hydra resolvers (path building, derived fields, etc.)

Useful Hydra patterns

Show composed config:

python inference/runner.py --cfg job

Notes on the split implementation (important details)

Your make_split(...) function:

enforces encounter-disjoint splits
writes a strong set of descriptive statistics (counts, missingness, concentration, etc.)
builds priors/co-occurrence/loglift only from TRAIN
writes “sequential OK” checks to confirm the split blocks are ordered
creates encounter_data.csv with column names normalized for downstream inference

If Date, Location, or Photographer are missing, the script can populate them from the encounter string using:

_add_date
_add_location
_add_photographer

This is convenient for internally consistent reporting, but if you have canonical metadata, prefer using that instead.

Suggested cleanup / rename (optional but recommended)

To make the repo easier for others to follow:

Rename inference/assisted_predictor.py -> inference/weight_search.py (or select_weights.py)
- “assisted predictor” reads like a different concept than “hyperparameter selection for fusion”
Keep graphing.py and Louvain artifacts out of the main tree (or move to an experiments/ folder)

Quickstart summary (copy/paste checklist)

Prepare dataset

Ensure CSV has Encounter, Label, Filepath (+ optional Date, Location, Photographer)

Make splits + artifacts

python preparation/make_split.py
# outputs: splits/train-XX/ with train/val/holdout/test + priors/loglift/etc.

Train classifier

python learning/main.py

Select fusion weights

python inference/assisted_predictor.py

Run evaluation

python inference/runner.py

Repository structure (high level)

configs/        Hydra configs and resolvers
preparation/    Splits + priors/coocc/loglift generation
learning/       ML training + evaluation
inference/      Fusion, weight selection, runner, metrics
data/           Data helpers and utilities
results/        Result writing, aggregation, plotting hooks
testing/        Minimal tests

License

No license for files on Dryad. For code on Zenodo, see LICENSE.

Encounter-conditioned photo-ID fusion (FinFriend code + data release)

Data files

Abstract

README: FinFriend

Dataset Release

Review Demo (Reproduces manuscript results)

Run

What’s in this repo (and what isn’t)

Relevant

Pipeline overview (the steps you actually run)

Installation

1) Create an environment

2) Sanity check imports

Data expectations

Core ID list (core_ids.csv)

Step 1 — Make a split (and build priors/co-occurrence/loglift)

What “train/val/holdout/test” mean here

Example: run make_split for multiple training fractions

Outputs written to out_dir

Step 2 — Train the ML classifier (learning/)

Step 3 — Weight selection

What it does

Typical run pattern

Output

Step 4 — Run inference / evaluate (inference/runner.py)

Typical run pattern

Expected output

Configuration guide (Hydra)

Useful Hydra patterns

Notes on the split implementation (important details)

Suggested cleanup / rename (optional but recommended)

Quickstart summary (copy/paste checklist)

Repository structure (high level)

License

Core ID list (`core_ids.csv`)

Outputs written to `out_dir`

Step 2 — Train the ML classifier (`learning/`)

Step 4 — Run inference / evaluate (`inference/runner.py`)