Data from: Image feature embedding with a deep learning framework improves genome-wide association studies on dog endophenotypes

Published May 14, 2026 on Dryad. https://doi.org/10.5061/dryad.kkwh70sjq

Data files

May 14, 2026 version files 27.46 MB

features.zip

27.44 MB
model_embedding-main.zip

11.84 KB
README.md

4.90 KB

Abstract

Domestic dogs exhibit remarkable morphological diversity, making quantitative characterization of their phenotypes challenging. Traditional phenotyping methods often rely on manual measurements, which are limited in their ability to capture complex visual traits. Deep learning provides a new opportunity to automatically extract informative and biologically meaningful features from images. In this study, we constructed a dataset of 13,254 dog images across multiple breeds and employed ResNet and ViT models to automatically extract 256-dimensional image embeddings. After dimensionality reduction using UMAP, we performed a GWAS on the extracted features and breed-level genotype data. We identified 15 genes previously reported to be associated with dog traits such as hair length and body size, as well as novel candidate genes related to body development and hair growth, including EIF2S2, TRHR, and TCF25, which harbor variants with potential functional relevance. This approach is validated by known genetic associations and can reveal new genotype-phenotype links. Building on these capabilities, this approach provides a scalable framework for phenotype extraction that enables population genetic studies in domestic dogs and can facilitate breeding in other economically important species.

Dataset DOI: 10.5061/dryad.kkwh70sjq

Description of the data and file structure

The dataset of domestic dog images was constructed to support image-based phenotyping and downstream genomic analyses. Raw images were collected via an automated script using the Image-Downloader framework (https://github.com/HeroPPPPath/Image-Downloader-master) by querying Bing image search with standardized dog breed names. A manual curation process by two independent reviewers ensured breed consistency, exclusion of puppies, adequate image quality, minimal occlusion, and minimal human interference. Images labeled for retention were cropped if needed to remove human interference. The features.zip contains two HDF5 files corresponding to feature embeddings extracted using ResNet50 and ViT_b_16, respectively.

Files and variables

File: features.zip

Description: The dataset includes deep learning-derived feature embeddings extracted from 13,254 dog images covering 181 breeds for GWAS analyses.

File: model_embedding-main.zip

Description: The dataset includes Jupyter notebooks and Python scripts.

Software & Scripts:

model_embedding.py: The main entry point for the feature extraction process. It handles command-line arguments, including input/output paths, model selection (ResNet/ViT), and specific layer targeting.

extract.py: Contains the core logic for data loading and feature extraction. It utilizes PyTorch’s ResNet and ViT backbones to process images and saves the resulting high-dimensional vectors into an HDF5 format.

FeatureExtractor.py: A helper module defining the FeatureExtractor class. It enables the extraction of intermediate-layer representations and optionally projects them into a lower-dimensional embedding space.

UMAP_embedding.ipynb: A Jupyter Notebook for post-processing the extracted features. It includes:

Loading features from HDF5 files.
Finding the optimal UMAP dimensionality based on variance ratios.
Performing UMAP embedding and outlier detection using the Z-score method.
Saving the final low-dimensional features and group means to CSV.

requirements.txt: Lists the necessary Python dependencies.

They rely on the following packages:

h5py==3.8.0
torch==2.3.0
pillow==10.4.0
tqdm==4.66.5
torchvision==0.18.0
scikit-learn==1.3.2
umap-learn==0.5.6
pandas==1.2.4
numpy==1.24.3

All required Python packages are listed in requirements.txt. To install them, run:

pip install -r requirements.txt

Feature Extraction Code Usage

Navigate to the script directory
```
cd path/to/model_embedding-main
```

Run the script

The script supports the following command-line arguments (see model_embedding.py for details):

Argument	Description	Default
--input	Path to the directory containing input images (organized by class folders)	required
--output_dir	Directory to save extracted features	current script directory
--model	Model to extract features	resnet50
--model_path	Path to custom model weights (optional)	None
--layer_name	Name of the layer to extract features from	default depends on model
--model_output_dimension	Dimension of the extracted feature vectors	256
--keep_original_dim	Whether to keep the original model output dimension	False

Example command

python model_embedding.py \
  --input /path/to/dog-breeds-181 \
  --output_dir /path/to/save/features \
  --model resnet50 \
  --layer_name layer4 \
  --model_output_dimension 256 \
  --keep_original_dim False

UMAP Embedding Usage

The extracted image features are saved in the HDF5 file feature.hdf5. To perform personalized dimensionality reduction using UMAP, we provide the Jupyter Notebook UMAP_embedding.ipynb.

Access information

Other publicly accessible locations of the data:

https://www.kaggle.com/datasets/egoose/dog-breeds-181