Data from: Forest tree breeding using genomic Markov causal models: A new approach to genomic tree breeding improvement

Jurcic, Esteban J.1 ; Dutour, Joaquín2 ; Villalba, Pamela V.3 ; Centurión, Carmelo2 ; Cantet, Rodolfo J. C.4 ; Munilla, Sebastián5 ; Cappa, Eduardo P.1

Published Mar 11, 2025 on Dryad. https://doi.org/10.5061/dryad.pzgmsbczh

Data files

Mar 11, 2025 version files 44.19 MB

Phenotypic_and_pedigree_file.txt

138.58 KB
README.md

3.73 KB
SNP_file.fwf

44.05 MB

Abstract

Traditionally, a pedigree-based individual-tree mixed model (ABLUP) has been used in forest genetic evaluations to identify individuals with the highest breeding values (BVs). ABLUP is a Markovian causal model, as any individual BV can be expressed as a linear regression on its parental BVs. The regression coefficients are based on the genealogical parent-offspring relationship and are equal to one-half. This study aimed to develop and apply two new causal models that replace these fixed coefficients with ones calculated using genomic information, specifically derived from the genomic-based relationship matrix. We compared the performance of these genomic-based causal models with ABLUP and non-causal GBLUP models. To do so, we evaluated a four-generation population of Eucalyptus grandis, consisting of 3,082 genotyped trees with 14,033 single nucleotide polymorphism markers. Six traits were assessed in 1,219 trees across the first three breeding cycles. The heritability and genetic means estimates were higher in the causal pedigree- and genomic-based models compared to GBLUP. Realized genetic gains were similar across all models, but the causal models more closely matched the predicted gains than GBLUP. In turn, GBLUP demonstrated better predictive performance, albeit with lower precision. The causal models developed in this study enable discerning intra-familial variations in the predictions of BVs at a lower computational burden and offer a potential alternative to the GBLUP model.

https://doi.org/10.5061/dryad.pzgmsbczh

Description of the data and file structure

GENERAL INFORMATION

1. Title of Dataset: Forest tree breeding using genomic Markov causal models: A new approach to genomic tree breeding improvement

2. Author Information

A. Principal Investigator Contact Information

Name: Esteban Javier Jurcic

Institution: Instituto Nacional de Tecnología Agropecuaria (INTA)

Address: De Los Reseros y Dr. Nicolás Repetto s/n, 1686, Hurlingham, Buenos Aires, Argentina.

Email: jurcic.esteban@inta.gob.ar

B. Associate or Co-investigator Contact Information

Name: Eduardo Pablo Cappa

Institution: Instituto Nacional de Tecnología Agropecuaria (INTA) - CONICET

Address: De Los Reseros y Dr. Nicolás Repetto s/n, 1686, Hurlingham, Buenos Aires, Argentina.

Email: cappa.eduardo@inta.gob.ar

Information about funding sources that supported the collection of the data: This research was supported by UPM-Forestal Oriental S.A.

DATA & FILE OVERVIEW

1. File List:

Phenotypic_and_pedigree_file.txt: tree information

SNP_file.fwf: marker information

2. Relationship between files, if important: self (tree ID) to relationship the Phenotypic_and_pedigree_file.txt with the SNP_file.fwf files

3. Additional related data collected that was not included in the current data package: -

4. Are there multiple versions of the dataset? no

People involved with sample collection, processing, analysis and/or submission: Joaquín Dotour, Alexandra Simonov , Robert Silvestre, Esteban J. Jurcic, Eduardo P. Cappa

Files and variables

DATA-SPECIFIC INFORMATION FOR: Phenotypic_and_pedigree_file.txt

1. Number of variables: 11

2. Number of cases/rows: 3082

3. Variable List:

self: tree ID

DAD: dad ID

MUM: mum ID

Trial: Trials: Tres bocas (TB), Pandule (PA), Young (YO), Gallinal (GA) and Greenhouse (GH)

DBH (cm): diameter at breast height measured in centimeters

HT (m): total tree height measured in meters

PY (%): pulp yield expressed as a percentage

LIG (%): lignin expressed as a percentage

CEL (%): cellulose expressed as a percentage

WD (kg/m3): wood density measured in kilograms per cubic meter

4. Missing data codes: 0

DATA-SPECIFIC INFORMATION FOR: SNP_file.fwf

1. Number of variables: 3082

2. Number of cases/rows: 2

3. Variable List:

Column 1: self (tree ID)

Column 2: 14286 snp markers

4. Missing data codes: 5

Code/software

Title: Relationship matrices (and their inverses) corresponding to the new genomic causal PARBLUP PARBLUP_sm models:

an example based on the Figure 1 of this manuscript.

Author: Esteban J. Jurcic (jurcic.esteban@inta.gob.ar)

Date: "February 19th, 2025"

NOTE: It is important to note that this R-script is only valid for populations that do not have related parents by pedigree,this is the most common situation within forest improvement populations.

Data files: To build the relationship matrix corresponding to the PARBLUP and PARBLUP_sm models (and their inverses), pedigree information (data) is required. The data should include the columns suc, dad, mum, and var, which indicate the individual ID, father, mother, and type of variable (exogenous or endogenous), respectively. Additionally, the genomic relationship matrix (G) is required, and it should be in the same order as the data file. The files used in this analysis are based on the pedigree shown in Figure 1 (U=1, V=2, W=3, X=4, Y=5, and Z=6).