Data from: Image feature embedding with a deep learning framework improves genome-wide association studies on dog endophenotypes
Data files
May 14, 2026 version files 27.46 MB
-
features.zip
27.44 MB
-
model_embedding-main.zip
11.84 KB
-
README.md
4.90 KB
Abstract
Domestic dogs exhibit remarkable morphological diversity, making quantitative characterization of their phenotypes challenging. Traditional phenotyping methods often rely on manual measurements, which are limited in their ability to capture complex visual traits. Deep learning provides a new opportunity to automatically extract informative and biologically meaningful features from images. In this study, we constructed a dataset of 13,254 dog images across multiple breeds and employed ResNet and ViT models to automatically extract 256-dimensional image embeddings. After dimensionality reduction using UMAP, we performed a GWAS on the extracted features and breed-level genotype data. We identified 15 genes previously reported to be associated with dog traits such as hair length and body size, as well as novel candidate genes related to body development and hair growth, including EIF2S2, TRHR, and TCF25, which harbor variants with potential functional relevance. This approach is validated by known genetic associations and can reveal new genotype-phenotype links. Building on these capabilities, this approach provides a scalable framework for phenotype extraction that enables population genetic studies in domestic dogs and can facilitate breeding in other economically important species.
Dataset DOI: 10.5061/dryad.kkwh70sjq
Description of the data and file structure
The dataset of domestic dog images was constructed to support image-based phenotyping and downstream genomic analyses. Raw images were collected via an automated script using the Image-Downloader framework (https://github.com/HeroPPPPath/Image-Downloader-master) by querying Bing image search with standardized dog breed names. A manual curation process by two independent reviewers ensured breed consistency, exclusion of puppies, adequate image quality, minimal occlusion, and minimal human interference. Images labeled for retention were cropped if needed to remove human interference. The features.zip contains two HDF5 files corresponding to feature embeddings extracted using ResNet50 and ViT_b_16, respectively.
Files and variables
File: features.zip
Description: The dataset includes deep learning-derived feature embeddings extracted from 13,254 dog images covering 181 breeds for GWAS analyses.
File: model_embedding-main.zip
Description: The dataset includes Jupyter notebooks and Python scripts.
Software & Scripts:
model_embedding.py: The main entry point for the feature extraction process. It handles command-line arguments, including input/output paths, model selection (ResNet/ViT), and specific layer targeting.
extract.py: Contains the core logic for data loading and feature extraction. It utilizes PyTorch’s ResNet and ViT backbones to process images and saves the resulting high-dimensional vectors into an HDF5 format.
FeatureExtractor.py: A helper module defining the FeatureExtractor class. It enables the extraction of intermediate-layer representations and optionally projects them into a lower-dimensional embedding space.
UMAP_embedding.ipynb: A Jupyter Notebook for post-processing the extracted features. It includes:
- Loading features from HDF5 files.
- Finding the optimal UMAP dimensionality based on variance ratios.
- Performing UMAP embedding and outlier detection using the Z-score method.
- Saving the final low-dimensional features and group means to CSV.
requirements.txt: Lists the necessary Python dependencies.
They rely on the following packages:
h5py==3.8.0
torch==2.3.0
pillow==10.4.0
tqdm==4.66.5
torchvision==0.18.0
scikit-learn==1.3.2
umap-learn==0.5.6
pandas==1.2.4
numpy==1.24.3
All required Python packages are listed in requirements.txt. To install them, run:
pip install -r requirements.txt
Feature Extraction Code Usage
-
Navigate to the script directory
cd path/to/model_embedding-main -
Run the script
The script supports the following command-line arguments (see
model_embedding.pyfor details):Argument Description Default --input Path to the directory containing input images (organized by class folders) required --output_dir Directory to save extracted features current script directory --model Model to extract features resnet50 --model_path Path to custom model weights (optional) None --layer_name Name of the layer to extract features from default depends on model --model_output_dimension Dimension of the extracted feature vectors 256 --keep_original_dim Whether to keep the original model output dimension False -
Example command
python model_embedding.py \ --input /path/to/dog-breeds-181 \ --output_dir /path/to/save/features \ --model resnet50 \ --layer_name layer4 \ --model_output_dimension 256 \ --keep_original_dim False
UMAP Embedding Usage
The extracted image features are saved in the HDF5 file feature.hdf5. To perform personalized dimensionality reduction using UMAP, we provide the Jupyter Notebook UMAP_embedding.ipynb.
Access information
Other publicly accessible locations of the data:
