Encounter-conditioned photo-ID fusion (FinFriend code + data release)
Data files
Apr 21, 2026 version files 2.54 GB
-
all_data.csv
5.56 MB
-
core_id.csv
1.47 KB
-
pkl.tar.xz
517.64 MB
-
pkl.zip
2.01 GB
-
README.md
12.08 KB
-
splits.zip
13.42 MB
Abstract
FinFriend is an encounter-aware photo-identification fusion pipeline that refines per-image identity posteriors using two lightweight context terms learned from the training split: (i) global sighting priors and (ii) an encounter-conditioned co-occurrence (log-lift) context prior. The method is model-agnostic and operates as post-processing on classifier outputs. This dataset contains a two main archive files for splits (splits.zip) and the raw-logit output of each of the classifiers trained in the accompanying manuscript (pkl.tar.xz). The raw output is a .tar.xz file for compression purposes. No leakage invariant: co-occurrence artifacts (priors/loglift) are computed from TRAIN only, and the test split is the newest encounters (chronological split). For full documentation, configuration details (Hydra), and optional training code, see the included README.
Dataset Release
splits.zipcontains all pre-generated content for each training split discussed in the associated manuscript.pkl.tar.xzandpkl.zipcontain the raw logit based output from each of the classifiers trained on the data splits described and detailed insplits.zipall_data.csvcontains an entry for each file including its label, encounter, date, location, and photographer (all pseudo-anonymized while retaining encounter ordering and image-encounter and animal-encounter distributions)core_id.csvcontains the list of IDs which occur in each training split.
Encounter-conditioned fusion for photo-ID (taxon-agnostic; case study: Bigg’s killer whales)
FinFriend is an encounter-aware photo-identification fusion pipeline. It combines:
- a per-image ML classifier (trained in
learning/), with - encounter context (co-occurrence structure learned from the training split), to
- produce encounter-level identity predictions and evaluation reports.
The core idea is simple: if individuals co-occur non-randomly, then knowing “who tends to be seen with whom” can improve ambiguous photo-ID predictions when multiple images belong to the same encounter.
Review Demo (Reproduces manuscript results)
This release contains:
code/— FinFriend source snapshot used for the paper. Hosted on Zenodo (MIT License)splits/— precomputed splits + TRAIN-only artifacts (priors, co-occurrence, loglift)pkl/— per-image classifier outputs (class map + logits/probabilities)
Run
- In configs/paths.yaml, set root to the directory containing code/, splits/, and pkl/
- Install deps:
pip install -r requirements.txt - Run:
python -m inference.assisted_predictor
What’s in this repo (and what isn’t)
Relevant
preparation/— split construction + artefact generation (priors, co-occurrence, loglift, diagnostics).learning/— training and evaluating the ML classifier.inference/— fusion, weight selection, evaluation, and run orchestration.configs/— Hydra configs and resolvers.data/— input/split helpers and common utilities.results/— aggregation, plotting hooks, and result writing (used by inference runs).
Pipeline overview (the steps you actually run)
The end-to-end flow is:
- Make split (and automatically generate priors + co-occurrence + loglift from TRAIN only)
- Train ML classifier (image → identity probabilities)
- Select fusion weights (tune how much context/priors should influence predictions)
- Run / evaluate (apply fusion on holdout/test, export metrics and artifacts)
A key invariant: co-occurrence artifacts are built from the TRAIN split only, and the test split is the newest test_frac of encounters (chronological split), so there is no leakage from test into priors/co-occurrence/loglift.
Installation
1) Create an environment
Use any Python environment manager you like. For example:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
2) Sanity check imports
python -c "import preparation, learning, inference; print('ok')"
Data expectations
FinFriend expects a flat table (CSV / DataFrame) with at least:
Encounter— encounter identifier (string).Label(orIdentifier/ID/Id) — true identity label (string).Filepath— used for stats/diagnostics; strongly recommended.
Optional (but useful for reporting/diagnostics):
Date— per-image date (or something parseable). If not present, this repo can infer it from the encounter string in your example.Photographer,Location— used for descriptive stats.
Core ID list (core_ids.csv)
Splitting can optionally filter to a “core set” of IDs (e.g., individuals that have adequate representation). In make_split, the core_ids DataFrame is used to restrict identities across splits.
Step 1 — Make a split (and build priors/co-occurrence/loglift)
This is the foundation of the pipeline: it creates train/val/holdout/test CSVs and automatically produces:
priors.csv— global ID priors from TRAINcooccurrence.csv— co-occurrence counts from TRAINloglift.csv— log-lift association scores derived from TRAINencounter_id_list.csv— encounter -> “IDs present” table- diagnostics JSON + plots (priors distributions, heatmaps, support scale)
What “train/val/holdout/test” mean here
Splits are by encounter (not by image), and are contiguous in encounter order:
train | val | holdout | test (newest N%)
This preserves temporal realism: test is the newest encounters.
Example: run make_split for multiple training fractions
Your current __main__ block in the split script already demonstrates this pattern:
- fixed
test_frac = 0.10 - vary
train_fracover{0.1, 0.3, 0.5, 0.7} - output to
.../splits/train-10,train-30, etc.
A typical invocation pattern:
python preparation/make_split.py
If you prefer calling it as a module (depends on your packaging/import paths):
python -m preparation.make_split
Outputs written to out_dir
For each split directory (e.g., splits/train-30/) you should expect:
Split CSVs
train.csv,val.csv,holdout.csv,test.csvencounter_data.csv(concatenated split rows, with columns normalized for downstream inference)
artifacts (built from TRAIN only)
priors.csvcooccurrence.csvloglift.csvp1.csv,logp2.csv(intermediate probability terms used in loglift computation)encounter_id_list.csv
Diagnostics
split_statistics.jsonsplit_sequential_ok.jsonsupport_diagnostics.jsonpriors_coocc_loglift_stats.json
Plots
plots/priors_hist.png,plots/priors_log_hist.png,plots/priors_topK.pngplots/coocc_topK_log1p.png,plots/loglift_topK_clip.pngplots/support_*.png(support histograms and relationships)
Step 2 — Train the ML classifier (learning/)
Training code is included for completeness; not required to reproduce paper results.
The code in learning contains modules and configuration files necessary to re-create
the deep learning, Efficient Net B0-based model used here.
Step 3 — Weight selection
This step chooses the fusion hyperparameters (weights) that determine how much the final prediction should trust:
- the ML classifier probabilities,
- global priors,
- encounter context (loglift / co-occurrence support),
- and any initialization strategy for “who might be in the encounter”.
In your tree, weight selection logic is supported by:
inference/assisted_predictor.py(recommended rename:weight_search.pyorselect_weights.py)inference/weight_selection.pyinference/set_initialization.pyinference/fusion.pyinference/metrics.py,inference/evaluation.py
What it does
At a high level, weight selection:
- Loads predictions from the trained ML model on some evaluation split (often holdout).
- Runs fusion across a grid or set of candidate hyperparameters.
- Scores each configuration (macro-F1, accuracy, logloss, Brier, etc.).
- Writes out the best configuration(s) and a summary table.
Typical run pattern
You’ll generally provide:
- the split directory containing
loglift.csv,priors.csv, andencounter_data.csv - the model predictions file(s) to fuse
- the strategy family to evaluate (e.g., random-known, top-unknown, null-set, etc.)
Example shape of a run command:
python inference/assisted_predictor.py \
+paths.split_dir=/path/to/splits/train-30 \
+paths.predictions_dir=/path/to/model/preds \
+fusion.grid=default
Again, exact key names depend on your Hydra config; the repo’s intent is that these are controlled via:
configs/fusion.yamlconfigs/results.yamlconfigs/paths.yamlconfigs/config.yaml
Output
Expect:
- a “best weights” config (or JSON/YAML artifact)
- a results table over the grid
- logs and optionally plots for the selection sweep
Step 4 — Run inference / evaluate (inference/runner.py)
Once you have:
- a split directory (with priors/loglift artifacts)
- a trained ML classifier (or its predictions)
- selected fusion weights
…you can run the full pipeline on holdout and/or test.
Where to look:
inference/runner.py— orchestration entrypointinference/encounter_inference.py— encounter-level inference logicinference/fusion.py— fusion implementationinference/evaluation.py— metrics and evaluation routinesinference/persistence.py— saving outputsresults/post_process.py,results/artifact_writer.py— result formatting and export
Typical run pattern
Example command shape:
python inference/runner.py \
+paths.split_dir=/path/to/splits/train-30 \
+paths.predictions_dir=/path/to/model/preds \
+paths.weights=/path/to/selected_weights.yaml \
+eval.split=test
Expected output
A run typically writes:
- per-encounter predictions
- per-image and aggregate metrics
- summary tables for reporting
- artifacts suitable for manuscript figures/tables (via
results/helpers)
Configuration guide (Hydra)
Configs live in configs/:
config.yaml— top-level compositionpaths.yaml— where data/splits/predictions/results livemodels.yaml— model selectionfusion.yaml— fusion hyperparameters and gridsresults.yaml— output formatting and artifact writinganonymize.yaml— dataset anonymization settings (if used)
Resolvers:
configs/resolvers.py— custom Hydra resolvers (path building, derived fields, etc.)
Useful Hydra patterns
Show composed config:
python inference/runner.py --cfg job
Notes on the split implementation (important details)
Your make_split(...) function:
- enforces encounter-disjoint splits
- writes a strong set of descriptive statistics (counts, missingness, concentration, etc.)
- builds priors/co-occurrence/loglift only from TRAIN
- writes “sequential OK” checks to confirm the split blocks are ordered
- creates
encounter_data.csvwith column names normalized for downstream inference
If Date, Location, or Photographer are missing, the script can populate them from the encounter string using:
_add_date_add_location_add_photographer
This is convenient for internally consistent reporting, but if you have canonical metadata, prefer using that instead.
Suggested cleanup / rename (optional but recommended)
To make the repo easier for others to follow:
- Rename
inference/assisted_predictor.py->inference/weight_search.py(orselect_weights.py)- “assisted predictor” reads like a different concept than “hyperparameter selection for fusion”
- Keep
graphing.pyand Louvain artifacts out of the main tree (or move to anexperiments/folder)
Quickstart summary (copy/paste checklist)
- Prepare dataset
- Ensure CSV has
Encounter,Label,Filepath(+ optionalDate,Location,Photographer)
- Make splits + artifacts
python preparation/make_split.py
# outputs: splits/train-XX/ with train/val/holdout/test + priors/loglift/etc.
- Train classifier
python learning/main.py
- Select fusion weights
python inference/assisted_predictor.py
- Run evaluation
python inference/runner.py
Repository structure (high level)
configs/ Hydra configs and resolvers
preparation/ Splits + priors/coocc/loglift generation
learning/ ML training + evaluation
inference/ Fusion, weight selection, runner, metrics
data/ Data helpers and utilities
results/ Result writing, aggregation, plotting hooks
testing/ Minimal tests
License
No license for files on Dryad. For code on Zenodo, see LICENSE.
