Mullis, Martin N.1 ; Lefebvre, Austin Epiphane Yann Tung-Shan1; Sivasubramanian, Kathyayini1; Luo, Angela1; Schmid, Florian1; Sooknah, Matt1; Wright, Kevin M.1; Raj, Anil1; Zavala-Solorio, José1; Zhang, Chunlian1; Riegler, Johannes1; Gillich, Astrid1; Ruby, J. Graham 1

Published Jan 29, 2026; Updated Feb 06, 2026 on Dryad. https://doi.org/10.5061/dryad.sj3tx96gd

Diversity Outbred (DO) mice are a powerful model system for mapping complex traits due to their high genetic diversity and mapping resolution. However, while there are extensive tools available for standard genetic analysis in DO mice, fewer techniques have been implemented to facilitate integrated, cross-study analysis. Here, we implement Haseman-Elston regression to estimate genetic correlations among 7,233 phenotypes measured across eleven independent DO mouse studies. We used this network of genetic correlations to cluster phenotypes according to shared genetics, which enhanced the power to detect quantitative trait loci (QTL). This approach empowered the detection of 884 QTL for 383 meta-phenotypes, explaining an average of 40.36% of the total genetic variance per mega-analysis. We leveraged this network for insights into specific areas of biology, including lifespan, frailty, immune composition, histological and functional lung phenotypes, and histological phenotypes of the aorta. We found the genetics of lifespan to share limited correlation with the genetics of frailty but stronger correlation with the genetics of immune cell composition. Additionally, mega-analyses driven by genetic correlations identified candidate genes (e.g., Cdkn2b) associated with degraded extracellular matrix in the aorta. Finally, an ensemble of genetic analyses implicated pulmonary neuroendocrine cell signaling and/or differentiation as a key driver of multiple lung pathophenotypes.

Dataset DOI: 10.5061/dryad.sj3tx96gd

Description of the data and file structure

The data in this repo correspond to a genetic analysis of Diversity Outbred (DO) mice conducted using internal data collected by Calico Life Sciences LLC and publicly available DO mouse datasets. This involved unifying several previously generated datasets, each of which involved distinct experimental efforts, for mega-analysis.

Files and variables

File: Table_S1_DO_Phenotypes_normalized.csv

Description: Contains Mouse IDs, covariates used in phenotype processing, and z-score normalized phenotype values. ‘Mouse_ID’ is a unique identifier corresponding to each unique animal and includes study information. ‘Age’ is a covariate corresponding to the age of each animal at the time of phenotyping. For certain animals, different phenotypes were collected at different ages; for these animals, ‘Age’ is recorded as “variable”. ‘Sex’ corresponds to the sex of the animal (‘F’ for female and ‘M’ for male). ‘Diet’ corresponds to the dietary intervention group for each animal in the DRiDO study (Di Francesco et al. 2024), but may also correspond to the number of days on a high fat diet (animals from the Svenson high-fat diet study, denoted as ‘Sven_HFD’ (Gatti et al. 2017)) or pharmacological intervention (e.g., lifespan data from (Mullis et al. 2025)). ‘Genwave’ corresponds to the DO generation of each animal. Additional columns correspond to various z-score normalized phenotypes after correction for the aforementioned covariates. Phenotype names contain a prefix corresponding to the study from which they were collected: ‘attie’ (pancreatic phenotypes from (Keller et al. 2018)), ‘bone’ (bone strength phenotypes from (Al-Barghouthi et al. 2021)), ‘cal_cr’ (DRiDO phenotypes from (Di Francesco et al. 2024)), ‘cal_int’ (newly published frailty, aorta, and lung phenotypes), ‘catnap’ (metabolic cage data from (Z. Chen et al. 2022)), ‘ches_striatum’ (phenotypes from (Bagley et al. 2022)), ‘lifespan’ (lifespan data from independent lifespan studies analyzed in (Mullis et al. 2025); ‘shock’ and ‘har’ correspond to the Shock and Harrison studies, respectively), ‘pazdro_heart’ (cardiac phenotypes from (Starcher et al. 2021)), and ‘Sven_HFD’ (various phenotypes from (Gatti et al. 2017)). Phenotypes may also contain a suffix corresponding to dietary intervention groups from the DRiDO study: ‘AL’ (ad libitum), ‘20’ (20% caloric restriction), ‘40’ (40% caloric restriction), ‘1D’ (one-day intermittent fasting), ‘2D’ (two-day intermittent fasting). Phenotypes from the DRiDO study collected annually or bi-annually will contain timepoint information corresponding to the year of the study in which they were collected (‘Y1’ - ‘Y4’). Biannual measurements are labeled with ‘A’ or ‘B’, corresponding to the first or second annual time point.

Variables

Mouse_ID: a unique identifier corresponding to each unique animal and includes study information.
Age: a covariate corresponding to the age of each animal at the time of phenotyping in days. For certain animals, different phenotypes were collected at different ages; for these animals, ‘Age’ is recorded as “variable”.
Sex: the sex of the animal (‘F’ for female and ‘M’ for male).
Diet: the dietary intervention group for each animal in the DRiDO study (Di Francesco et al. 2024), but may also correspond to the number of days on a high fat diet (animals from the Svenson high-fat diet study, denoted as ‘Sven_HFD’ (Gatti et al. 2017)) or pharmacological intervention (e.g., lifespan data from (Mullis et al. 2025)).
Genwave: the DO generation of each animal.
Additional columns: various z-score normalized phenotypes after correction for the aforementioned covariates. See above for details on phenotype prefixes and suffixes.

File: Table_S2_Study_summary.xlsx

Description: Data sources for DO mouse phenotypes. Contains information on the studies from which phenotypes were obtained. This includes the number of animals used and phenotypes measured, the general focus of each study, the source of the phenotype and genotype data, the genotyping array used, and the prefix assigned to corresponding Mouse_ID and phenotype labels.

Variables

Study: the study from which data was collected. Corresponds to how each study is referred to in the accompanying manuscript.
No. mice: the number of mice used in the study.
No. phenotypes: the number of phenotypes collected in the study.
Focus: the focus of the study, corresponding to broad aspects of physiology reflected in the in measured phenotypes.
Source: the source of the data from the study. Can be an online database or a paper. 'N/A' means that the data was not previously available.
Genotype array: the MUGA genotype array used to genotype the animals.
Data prefix: the prefix used to tag phenotypes from each study. Corresponds to the phenotype prefixes used in Table_S1.

File: Table_S3_h2_all_traits.txt

Description: Heritability (h²) of individual traits.**** Individual heritability estimates for each of the phenotypes in the dataset, before filtering based on h².

Variables

Trait: a phenotype from Table_S1.
h2: the heritability of the trait as estimated by Haseman-Elston regression.
SE: the standard error of the trait.
p: the p-value of the h2 estimate
n: the number of animals used in the estimate.

File: Table_D1_rg_filtered.RData

Description: Pairwise genetic correlation (r_g) estimates for each of the phenotypes in the dataset, after filtering based on h².

Variables

Trait1: the first trait.
Trait2: the second trait.
rg: the genetic correlation between the two traits. 'NA' values may occur if the heritability estimate for one or more of the traits is negative.
SE: the standard error of the genetic correlation estimate.
rg_p: the p-value associated with the Haseman-Elston (HE) regression coefficient.
r: the Pearson correlation coefficient of the two traits.
r_p: the p-value of the Pearson correlation coefficient.
h2_1: the heritability of the first trait via HE regression.
h2_2: the heritability of the second trait via HE regression.
n_1: the number of animals in which the first trait was measured.
n_2: the number of animals in which the second trait was measured.

File: Table_S4_rg_matrix_clustered.RData

Description: Hierarchically clustered matrix of genetic correlations (r_g).**** Row and column names correspond to the DO phenome (see Table_S1) after filtering by h². This matrix includes DRiDO phenotypes split by diet group.

File: Table_S5_rpg_matrix_clustered.RData

Description: Hierarchically clustered matrix of Pearson genetic correlations (r_pg). Row and column names correspond to the DO phenome (see Table_S1) after filtering by h². This matrix includes DRiDO phenotypes split by diet group.

File: Table_S6_trait_groups.txt

Description: Meta-trait composition. Filtered by h² and excluding diet-specific DRiDO phenotypes. The first column (no column name) is an index.

Variables

Trait: corresponds to one of the phenotypes in Table S1.
Cluster: specifies the meta-trait each trait comprises.

File: Table_S7_meta_phenotypes.txt

Description: Meta-traits constructed from clustered individual phenotypes. Animal IDs are listed in the ‘Mouse_ID’ column, with each subsequent column corresponding to a meta-trait.

Variables

MouseID: The animal ID.
Additional columns: correspond to meta-traits constructed from clusters of genetically correlated phenotypes. If none of the traits comprising a meta-trait were measured in a particular animal, the trait value for that animal is ‘NA’.

File: Table_S8_h2_meta_traits.txt

Description: Heritability estimates for each of the meta-phenotypes in the dataset, excluding meta-traits composed of lifespan data from the Ellison study (prefixed with “lifespan_ell”, which were not significantly heritable and excluded from analysis).

Variables

Trait: a meta-trait, which is composed of individual phenotypes listed in Table S8.
h2: the heritability of the meta-trait as estimated by Haseman-Elston regression.
SE: the standard error of the trait.
p: the p-value of the *h^2 ^*estimate.
n: the number of animals used in the estimate.

File: Table_S9_rg_meta_traits.txt

Description: Pairwise genetic correlation estimates for each of the meta-phenotypes in the dataset, excluding meta-traits composed of lifespan data from the Ellison study (prefixed with “lifespan_ell”, which were not significantly heritable and excluded from analysis).

Variables

Trait1: the first trait for which the genetic correlation (rg) is estimated.
Trait2: the second trait for which the rg is estimated.
rg: the genetic correlation estimate of Trait1 and Trait2. 'NA' values may occur if the heritability estimate for one or more of the traits is negative.
SE: the standard error of the estimate.
rg_p: the p-value associated with the Haseman-Elston regression.
r: the Pearson correlation coefficient of Trait1 and Trait2.
r_p: the p-value of the Pearson correlation.
h2_1: the heritability (h2) estimate for Trait1.
h2_2: the heritability (h2) estimate for Trait2.
n_1: the number of animals for which Trait1 was measured.
n_2: the number of animals for which Trait2 was measured.

File: Table_S10_clustered_meta_traits_rg.txt

Description: Hierarchically clustered matrix of genetic correlations among meta-traits. Row and column names correspond to the meta-trait after removing two meta-traits assembled from low h² lifespan estimates*.*

File: Table_S11_clustered_meta_traits_rpg.txt

Description: Hierarchically clustered matrix of Pearson genetic correlations among meta-traits. Row and column names correspond to the meta-trait after removing two meta-traits assembled from low h² lifespan estimates*.*

File: Table_S12_qtl_alpha05.txt

Description: QTL detected in meta-traits at permutation-based significance levels.****

Variables

Trait: meta-trait associated with the locus.
Chr: chromosome on which the QTL was detected.
Pos: position info for the QTL.
LOD: the logarithm of the odds (LOD) score of the QTL.
CI_start: the starting position of the 2LOD support interval.
CI_end: the ending position of the 2LOD support interval.

File: Table_S13_qtl_LOD6.txt

Description: QTL detected in meta-traits at LOD ≥ 6.

Variables

Trait: meta-trait associated with the locus.
Chr: chromosome on which the QTL was detected.
Pos: position info for the QTL.
LOD: the logarithm of the odds (LOD) score of the QTL.
CI_start: the starting position of the 2LOD support interval.
CI_end: the ending position of the 2LOD support interval.

File: Table_S14_fdr_by_threshold_all_meta_traits.txt

Description: False discovery rates (FDRs) for each phenome-wide meta-trait at LOD ≥ 6 and ⍺ = 0.05 significance thresholds.****

Variables

Trait: the meta-trait for which FDR was assessed.
LOD6_FDR: the FDR at a significance threshold of LOD ≥ 6.
alpha_05_FDR: the FDR at a permutation-based significance threshold of ⍺ = 0.05.

File: Table_S15_frailty_clusters.txt

Description: Frailty meta-trait composition. Lists each of the frailty phenotypes after filtering by h2.

Variables

Trait: corresponds to one of the phenotypes in Table_S1.
Cluster: specifies the frailty meta-trait each trait comprises.

File: Table_S16_aorta_clusters.txt

Description: Aorta meta-trait composition. Lists each of the aorta phenotypes after filtering by h².

Variables

Trait: corresponds to one of the phenotypes in Table S1. Trait suffixes correspond to the type of stain from which the measurement was derived: H&E (‘he’), trichrome (‘_tc’), or Verhoeff-Van Gieson (VVG; ‘_vvg’).
Cluster: specifies the aorta meta-trait each trait comprises.

File: Table_S17_fdr_by_threshold_aorta.txt

Description: False discovery rates (FDRs) for each aorta meta-trait at LOD ≥ 6 and ⍺ = 0.05 significance thresholds.

Variables

Trait: the meta-trait for which FDR was assessed.
LOD6_FDR: the FDR at a significance threshold of LOD ≥ 6.
alpha_05_FDR: the FDR at a permutation-based significance threshold of ⍺ = 0.05.

File: Table_S18_aorta_qtl.txt

Description: QTL detected in aorta mega-analysis at LOD ≥ 6.

Variables

Cluster: the ID of the aorta meta-trait.
Trait: the name of the meta-trait associated with the locus.
Chr: chromosome.
Pos: position of the peak marker.
LOD: LOD score.
CI_start: the starting position of the 2LOD support interval.
CI_end: the ending position of the 2LOD support interval.

File: Table_S19_lung_clusters.txt

Description: Lung meta-trait composition. Lists each of the lung phenotypes after filtering by h².

Variables

Trait: corresponds to one of the phenotypes in Table_S1. As in Table S16, trait suffixes correspond to the type of stain from which a phenotypic measurement was derived, when present.
Cluster: specifies the lung meta-trait each trait comprises.

File: Table_S20_lung_qtl.txt

Description: QTL detected in lung mega-analysis at LOD ≥ 6.

Variables

Cluster: the ID of the lung meta-trait.
Trait: the name of the meta-trait associated with the locus.
Chr: chromosome.
Pos: position of the peak marker.
LOD: LOD score.
CI_start: the starting position of the 2LOD support interval.
CI_end: the ending position of the 2LOD support interval.

File: Table_S21_fdr_by_threshold_lung.txt

Description: False discovery rates (FDRs) for each lung meta-trait at LOD ≥ 6 and ⍺ = 0.05 significance thresholds.

Variables

Trait: the meta-trait for which FDR was assessed.
LOD6_FDR: the FDR at a significance threshold of LOD ≥ 6.
alpha_05_FDR: the FDR at a permutation-based significance threshold of ⍺ = 0.05.

File: Data_D1_all_8state_69k.RData

Description: Haplotype probability information for the animals included in this study at ~69k genetic pseudomarkers. A list of three-dimensional matrices, with dimensions corresponding to animal, 8-state founder allele probability, and marker. The R object is titled 'apr' by default.

List structure

A list of 20 three-dimensional matrices, the first 19 of which correspond to each mouse autosome. The 20^th matrix corresponds to the X chromosome. Names of the matrices can be accessed via names(apr) in R.

Matrix structure

For each matrix, there are three dimensions.

Animal: the animals in the dataset (e.g. apr[[1]][1,,] results in a dataframe of 8-state allele probabilities at all markers for the first mouse on chromosome 1).
8-state allele probability: a vector of 8 probabilities corresponding to the DO founding strains (e.g. apr[[1]][,1,] results in a dataframe of probabilities for the first founder allele at all markers and for all mice on chromosome 1). The probability is the chance that the animal contains a particular founder strain's allele at a marker. The names along this dimension are A-H, each corresponding to a particular founding strain: A- A/J, B- C57BL/6J, C: 129S1/SvImJ, D: NOD/ShiLtJ, E: NZO/HlLtJ, F: CAST/EiJ, G: PWK/PhJ, H: WSB/EiJ. See https://www.jax.org/research-and-faculty/genetic-diversity-initiative/tools-data/diversity-outbred-reference-data for details.
Marker: A genetic pseudomarker (interpolated marker data) at a particular location on the chromosome (e.g. str(apr[[1]][,,1]) results in a dataframe 8-state allele probabilities for all mice at the first marker on chromosome 1).

Example

apr[[3]][1,5,8] would return a single value: the allele probability of the fifth founder allele (NZO/HlLtJ) at the 8^th pseudomarker on chromosome 3 for the first mouse in the dataset.

File: Data_D2_kinship_loco.Rdata

Description: A list of 20 kinship matrices generated via the ‘qtl2’ packages in R. Each matrix corresponds to a chromosome, with the 20th matrix corresponding to the X chromosome. Rows and columns in the matrices correspond to animals in the dataset.

File: Data_S1_69k_grid_pgmap.rdata

Description: Pseudomarkers used to interpolate genotype data across study. This is loaded into R as both a genetic map (named 'gmap' by default) and a physical map (named 'pmap' by default). Each map is a list of 20 vectors corresponding to chromosomes 1-19 and the X chromosome, respectively. Values in each vector correspond to the genetic position in centimorgans ('gmap') or physical position in megabases ('pmap'). The names of each vector correspond to the name of the pseudomarker using the naming convention "chromosome_position", where position is the position of the pseudomarker in bases.

File: Data_D3_meta_trait_gwas.zip

Description: Genome-wide scan data for meta-traits. This .zip file contains a directory titled 'gwas', which contains 383 .Rdata files, each of which encodes a single-column *rqtl2::scan1() *dataframe titled 'scan' that contains GWAS results for a meta-trait in the dataset.

The values in the 'scan' object are -log10(p) values for each of the ~69k pseudomarkers with rownames corresponding to the markers. The column name corresponds to the meta-trait that was analyzed.

File: Data_D4_variant_association_and_gene_annotations.zip

Description: Variant association mapping for each phenome-wide meta-trait QTL, and lists of genome annotations within the confidence intervals of the QTL. The .zip file contains a directory titled 'finemapping', which contains ~1,700 files, each of which contains variant association data for a QTL detected in the GWAS of a meta-trait in our dataset.

There are two files for each QTL: a variant association mapping file and a genes file. The variant association mapping files use the naming convention "meta_trait_k_385_cluster_X_chrY_posZ", where X is the meta-trait, Y is the chromosome, and Z is the position of the sentinel pseudomarker underlying the QTL. The genes file follow the same convention but have the suffix "_genes" at the end of the filename.

Variant association files:

Description: Dataframes containing variant association mapping for a particular QTL.

Variables:

snp: the name of the variant- contains chromosome, position, and nucleotide information.
LOD: the LOD score of the variant.
chr: the chromosome on which the variant is located.
pos: the physical position of the variant in megabases.
alleles: the alleles/nucleotides present at the variant.
sdp: the strain distribution pattern of the alleles. This is a number/factor that encodes which of the 8 founder strains contains the major/minor allele.
ensembl_gene: the Ensembl ID of the gene that a variant is present within. A blank value indicates that the variant is not within a gene.
consequence: the function annotation associated with the variant.
A_J - WSB_EiJ: factors specifying which allele is present in each of the founder strains.
type: the type of variant. Can be "snp" (single nucleotide polymorphism), "indel" (insertion/deletion), or "SV" (structural variant, meaning a larger-scale genomic rearrangement).
on_map: TRUE/FALSE specifying whether the variant is in the physical/genetic map or not. Data for variants not on the map is imputed from genotype data at pseudomarkers during variant association.

Gene files:

Description: A dataframe of genes within the region used for variant association mapping. Each row corresponds to a gene.

Variables:

chromosome: the chromosome on which the gene is located.
source: the database used to access gene information.
type: the type of annotation. Can either be "gene" or "pseudogene"
start: the start position of the gene in megabases.
end: the ending position of the gene in megabases.
score: not used; always 'NA'.
strand: the strand of DNA on which the gene is encoded. Can be "+" or "-".
phase: not used; always 'NA'.
ID: the ID of the gene, taken from the 'source' database. Includes data on the database and the reference genome that was used.
Name: the name of the gene.
Parent: not used; always 'NA'.
Dbxref: databases that were cross-referenced for gene data.
gene_id: the ID of the gene, taken from the 'source' database. Only includes the source database and gene ID number (doesn't include ref genome information).
mgi_type: the gene annotation in the source database.
description: a functional description of the gene.

File: Data_S2_Aorta_Meta_Analyses.RData

Description: Genome-wide scan data for aorta meta-traits. See description of Data_S4.

File: Data_S3_Aorta_Variant_Association.zip

Description: Variant association mapping for each aorta meta-trait QTL, and lists of genome annotations within the confidence intervals of the QTL. See description of Data_S5.

File: Data_S4_Lung_Meta_Analyses.RData

Description: Genome-wide scan data for lung meta-traits. See description of Data_S4.

File: Data_S5_Lung_Variant_Association.zip

Description: Variant association mapping for each lung meta-trait QTL, and lists of genome annotations within the confidence intervals of the QTL. See description of Data_S5.

Code/software

Analysis was conducted using R/qtl2_0.36.

Functions used to estimate h² and r_g are available at: https://github.com/calico/HE-regression

Change log

Second version: updates to file names. No data changed.

Genetic correlation-guided mega-analysis of DO mice provides mechanistic insight and candidate genes for age-related pathologies

Data files

Abstract

README: Genetic correlation-guided mega-analysis of DO mice provides mechanistic insight and candidate genes for age-related pathologies

Description of the data and file structure

Files and variables

File: Table_S1_DO_Phenotypes_normalized.csv

Variables

File: Table_S2_Study_summary.xlsx

Variables

File: Table_S3_h2_all_traits.txt

Variables

File: Table_D1_rg_filtered.RData

Variables

File: Table_S4_rg_matrix_clustered.RData

File: Table_S5_rpg_matrix_clustered.RData

File: Table_S6_trait_groups.txt

Variables

File: Table_S7_meta_phenotypes.txt

Variables

File: Table_S8_h2_meta_traits.txt

Variables

File: Table_S9_rg_meta_traits.txt

Variables

File: Table_S10_clustered_meta_traits_rg.txt

File: Table_S11_clustered_meta_traits_rpg.txt

File: Table_S12_qtl_alpha05.txt

Variables

File: Table_S13_qtl_LOD6.txt

Variables

File: Table_S14_fdr_by_threshold_all_meta_traits.txt

Variables

File: Table_S15_frailty_clusters.txt

Variables

File: Table_S16_aorta_clusters.txt

Variables

File: Table_S17_fdr_by_threshold_aorta.txt

Variables

File: Table_S18_aorta_qtl.txt

Variables

File: Table_S19_lung_clusters.txt

Variables

File: Table_S20_lung_qtl.txt

Variables

File: Table_S21_fdr_by_threshold_lung.txt

Variables

File: Data_D1_all_8state_69k.RData

List structure

File: Data_D2_kinship_loco.Rdata

File: Data_S1_69k_grid_pgmap.rdata

File: Data_D3_meta_trait_gwas.zip

File: Data_D4_variant_association_and_gene_annotations.zip

File: Data_S2_Aorta_Meta_Analyses.RData

File: Data_S3_Aorta_Variant_Association.zip

File: Data_S4_Lung_Meta_Analyses.RData

File: Data_S5_Lung_Variant_Association.zip

Code/software

Change log