Skip to main content

The heritability of size in a wild annual plant population with hierarchical size structure

Cite this dataset

Schoen, Daniel (2024). The heritability of size in a wild annual plant population with hierarchical size structure [Dataset]. Dryad.


The relative magnitude of additive genetic versus residual variation for fitness traits is important in models for predicting the rate of evolutionary change and population persistence in response to changes in the environment. In many annual plants, lifetime reproductive fitness is correlated with end-of-season plant biomass, which can vary significantly from plant to plant in the same population. We measured end-of-season plant biomasses and obtained SNP genotypes of plants in a dense, natural population of the annual plant species Impatiens capensis with hierarchical size structure. These data were used to estimate the amount of heritable variation for position in the size hierarchy and for plant biomass. Additive genetic variance for position in the size hierarchy and plant biomass were both significantly different from zero. These results are discussed in relationship to theory for the heritability of fitness in natural populations and ecological factors that potentially influence heritable variation for fitness in this species.


*­­­­Title of Data Set: * Heritability of fitness in a wild annual plant population with hierarchical size structure (Note: This data set was updated on 19 June 2024, after discovering several quality control issues with the genomic data. We apologize for any inconvenience).

This data set includes measures of plant weight (fresh biomass in g) from a sample of 97 plants collected in three quadrats in the study population at random (random sample), and from 367 plants collected in five additional quadrats where large and small plants were sampled (extreme sample).

The 367 plants in the extreme sample were genotyped using genotyping-by-sequencing, and the vcf files for these plants and their SNP genotypes are also included in this DRYAD repository. The BAM files used for constructing this vcf file along with the Impatiens capensis reference genome are deposited in GenBank under SRA accession number PRJNA945897.

Description of the data and file structure

* *

File: *Impatiens_capensis_geneome_assembly_MCG3086_report.pdf   *Hi-rise scaffolding report for reference genome.

File: Phenotypes_Random_sample_N=97.xlsx    Plant height (cm), Leaf number per plant, and plant weight (fresh biomass in g) from a sample of 97 plants of Impatiens capensis (random sample) in an Excel file.

File: Plant_weights_extreme_sample.txt * Plant weight (fresh biomass in g) from a sample of 367 plants (186 small and 181 large plants: see text of article) of *Impatiens capensis (extreme sample) in a text file with 3 columns. The first two columns combined are plant identifiers and correspond to plant identifiers in the file “Impatiens_367_genotyped_individuals.vcf”

File: Impatiens_367_genotyped_individuals.vcf Genotypic data used along with data in the file “Plant_weights_extreme_sample.txt” to estimate the heritability of plant weight and position in the size hierarchy using the scripts in the file ‘LDAK.code

File: Position_map.txt Positions (quadrats) of each genotyped plant.


File: Simulation_Extreme_Sampling.R  R program to simulate extreme sampling to examine the potential bias in estimating heritability and its significance level.  Uses data from

File: LDAK_code  Code used with the LDAK software version 5.2 ( to estimate the heritability of plant weight and position in the size hierarchy.

File:* *Code for K means analysis of plant weight data in random sample.

File:*   *Code for quality control. See article.

File: Snakefile.  Code for processing raw sequence reads and producing bam files for input into STACKS.* *Used to automate the processing of the sequence data, including trimming adapters, creating sorted .bam files, and removing duplicates.

File:* Linearmodel_wt_f(ht+lf_no).py. *Used to analyze the relationship between individual plant biomass versus leaf number per plant via linear model analysis.


Study population

The study population of Impatiens capensis is in Glen Sutton, Quebec, Canada (45o 02’ 37” N, 72o 32’ 57” W). The plants occur in damp soil within an irregularly shaped area of ca. 150 m2, beneath a canopy of a mixed, mature deciduous-evergreen (Acer saccharum-Tsuga canadensis) forest. I. capensis plants in this population form nearly pure stands that emerge as a near continuous carpet of seedlings on the forest floor. The density of individuals remains high (ca. 200-250 per m2) at the end of the season. Early season seedling density was at least twice as high. In I. capensis, the Pearson correlation between chasmogamous seed production and end of season biomass is r = 0.95, and between overall seed production and biomass is r = 0.92 (Waller, 1979).  Small plants have been shown to produce no chasmogamous flowers or fruit at all (Waller, 1979), thus making position in the size hierarchy an interesting fitness component for study.

Impatiens capensis reference genome

From a single plant collected in the study population, 10 g of young leaves were harvested and frozen in liquid nitrogen. Genomic DNA was extracted from this tissue and sequenced by Dovetail Genomics/Cantata Bio LLC (Scotts Valley, California). For each Dovetail Omni-C library, chromatin was fixed in place with formaldehyde in the nucleus and then extracted. Fixed chromatin was digested with DNAse I, chromatin ends were repaired and ligated to a biotinylated bridge adapter followed by proximity ligation of adapter containing ends (Jordan Zhang—Dovetail Genomics, pers comm.). After proximity ligation, crosslinks were reversed, and the DNA purified. Purified DNA was treated to remove biotin that was not internal to ligated fragments, and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters (Jordan Zhang—Dovetail Genomics, pers comm.). Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The library was sequenced on an Illumina HiSeqX platform to produce approximately 30x sequence coverage.

The input de novo assembly and Dovetail OmniC library reads were used as input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies (Putnam et al., 2016). Dovetail OmniC library sequences were aligned to the draft input assembly using bwa ( The separations of Dovetail OmniC read pairs mapped within draft scaffolds were analyzed by HiRise to produce a likelihood model for genomic distance between read pairs, and the model was used to identify and break putative mis-joins, to score prospective joins, and make joins (Jordan Zhang—Dovetail Genomics, pers comm.).

Plant biomass distribution in a random sample

At the end of the 2020 growing season for Impatiens capensis (15 September), we recorded plant height and numbers of leaves from 97 plants collected at random in three, haphazardly placed 1 m2 qu­adrats within the study population. All 97 plants were weighed fresh to the nearest 0.01 g, and linear regression was used to establish a predictive relationship for plant biomass based on plant height and leaf number per plant (r2 = 0.72, P < 0.001).

Size (biomass) inequality was examined by ranking plants from lightest weight to heaviest and graphing cumulative biomass against rank order and calculating the Gini coefficient, a measure of inequality of resource distribution (Dorfman, 1979). K-means clustering (Pedregosa et al., 2011) of the biomasses of 97 randomly sampled plants was used to determine the level of support for distinct plant size clusters, as well as the biomass cutoff that separates the clusters.

Plant sampling for estimating the heritability of position in the plant size hierarchy and biomass

From five, haphazardly placed 1 m2 quadrats in the population, we sampled from small plants close to ground level, and from plants in the upper part of the canopy of the population. This yielded an extreme sample of the smallest and largest plants (n = 186 and 181, respectively). The biomass cutoff separating small and large plants corresponded with that determined by K-means clustering of the random sample (see below).

Genotyping-by-sequencing and population genetics analyses

Leaf tissue from the small and large sampled plants was collected and preserved for DNA extraction using silica gel and RAD sequencing. DNA from samples of the preserved leaves was extracted and processed for genotyping-by-sequencing (GBS) (Elshire et al., 2011) at the University of Wisconsin Biotechnology Center. Genomic DNA was digested with the restriction enzyme ApeK1 and ligated to adapters and barcodes to create the GBS libraries. A NovaSeq6000 sequencer was used to obtain paired end (150 bp) sequence reads from the libraries. The depth of coverage was approximately 180x.

Raw reads were demultiplexed and filtered for Illumina adapter sequences and PCR duplicates with process_radtags and clone_filter (STACKS version 2.60; Rochette, Rivera-Colon and Catchen, 2019). The demultiplexed and filtered reads were then aligned to the reference genome using the Burrows-Wheeler Alignment (Li and Durbin, 2009) tool as implemented in bwakit version 0.7.12 and converted to BAM files using SAMtools version 1.13 (Lin et al., 2009). SNPs with quality scores < 30 were identified with gstacks (STACKS version 2.60) and further processed with populations (STACKS version 2.60), which was used to filter out loci not found in at least 95% of individuals in each of the five quadrats sampled and where the minimum allele frequency for the less common SNP allele was < 0.01. The STACKS populations program was used for calculating population genetics statistics (nucleotide diversity and Wright’s Fst) and for reporting the results of SNP filtering. For the estimation of the genomic relationship matrix (GRM) we applied a vcfR (version 1.12; Knaus & Grunwald, 2001) and SNPfiltR (version 1.01; DeRaad, 2022) R version 4.1.1 (R Core Team, 2021) to filter out SNPs with read depths < 7. To avoid spurious associations that might inflate relationship estimates of the GRM, SNPs with r2 (squared correlation coefficient between the alleles at two loci) > 0.05 and within 1000 kb windows were filtered out with the LDAK 5.2 package the LDAK 5.2 package (; Speed et al., 2020).

Analyses of genetic variance and heritable variation for position in the size hierarchy and plant biomass

Our main interest was whether the distinct end-of-season presentation of large versus small size classes that is characteristic of this species (Schmitt et al., 1987) reflects the outcome of genetically based selection, or instead, is mainly due to environmental differences that determine the formation of the size hierarchy, or in other words, whether position in the size hierarchy is heritable. In the first analysis we treated size as a threshold trait and analyzed it on a liability scale (small plants versus large plants) as is done in some agricultural genetic studies for traits such as disease resistance, secondary compound content, and disease resistance (Merrick et al., 2023). In the second analysis we estimated the heritability of biomass.

We employed different methods to estimate heritability of position in the plant size hierarchy, thereby allowing cross validation and comparison (given that the assumptions of each method differ). First, we used Phenotype Correlation–Genotype Correlation (PCGC), which estimates genetic correlation from genome-wide association study (GWAS) summary statistics. PCGC models dichotomous traits on a continuous liability scale, which permits heritability estimation when the observed phenotype is binary, as in this case. Second, we used a likelihood-based version of the tetrachoric correlation method (Speed & Martin, 2024), which uses phenotypic correlation to examine the pairwise correlation between the latent, underlying continuous distribution of factors that determine plant size in pairs of individuals. PCGC might be more sensitive to population structure and relatedness, potentially inflating heritability estimates if not properly accounted for. On the other hand, tetrachoric regression, by focusing on identity by descent (IBD) segments, might better handle close relationships and relatedness, providing an alternative perspective on the genetic architecture. Pairwise relationship estimation (as required by TetraHer was done using the KING (Kinship-based inference for GWAS) software (Manichaikul et al., 2010). We used the proportion of alleles that are identical by descent as our measure of kinship, as estimated from information on IBD segments. Both PCGC and TetraHer adjust for ascertainment, as present in case-control studies such as this one. Prevalence of the large class of plants was determined by the proportion of large plants observed in the random sample of 97 plants, as classified by K-means clustering—see above). We used the implementation of these method in the LDAK 5.2 package.

We also estimated the heritability of plant biomass as a quantitative trait, apart from information on the position in the size hierarchy. This was done in two ways. First, we used a linear model to estimate the genetic effects under restricted maximum likelihood (REML) The heritability of biomass was estimated assuming the GCTA (Yang, 2011) as implemented in the LDAK v.5.2 package ( Second, we used a modification of the tetrachoric correlation approach (Speed & Martin, 2024). Unlike TetraHer, this approach, termed “QuantHer”, assumes the phenotype pair is drawn from a bivariate normal distribution.

We note that sampling weights from the smallest and largest plants for REML and PCGC constitutes “extreme sampling”, but this has been shown previously to have a minimal biasing effect on heritability estimation (Golan et al., 2014). Nevertheless, we also conducted our own simulation analyses to gauge the possible effects of sampling protocol on our own results (Supplementary Methods for details).


Anderson, J. T. (2016). Plant fitness in a rapidly changing world. New Phytologist 210:81-87.

Bérénos, C., Ellis, P. A., Pilkington, J. G., & Pemberton, J. M. (2014). Estimating quantitative genetic parameters in wild populations: A comparison of pedigree and genomic approaches. Molecular Ecology, 23(14), 3434-3451.

Bontemps, A., Lefèvre, F., Davi, H., & Oddou‐Muratorio, S. (2016). In situ marker‐based assessment of leaf trait evolutionary potential in a marginal European beech population. Journal of Evolutionary Biology, 29(3), 514-527.

Burt, A. (1995). The evolution of fitness. Evolution, 49(1), 1-8.

Castellanos, M. C., González‐Martínez, S. C., & Pausas, J. G. (2015). Field heritability of a plant adaptation to fire in heterogeneous landscapes. Molecular Ecology, 24(22), 5633-5642.

Chen, J., Glémin, S., & Lascoux, M. (2017). Genetic diversity and the efficacy of purifying selection across plant and animal species. Molecular biology and evolution, 34(6), 1417-1428.

Day, P. D., Pellicer, J., & Kynast, R. G. (2012). Orange balsam (Impatiens capensis Meerb., Balsaminaceae): A re-evaluation by chromosome number and genome size. The Journal of the Torrey Botanical Society, 139(1), 26-33.

DeRaad, D. A. (2022). SNPfiltR: An R package for interactive and reproducible SNP filtering. Molecular Ecology Resources, 22(6), 2443-2453.

Donohue, K. (2002). Germination timing influences natural selection on life‐history characters in Arabidopsis thalianaEcology, 83(4), 1006-1016.

Dorfman, R. (1979). A formula for the Gini coefficient. The Review of Economics and Statistics, 61, 146-149.

Elshire, R. J., Glaubitz, J. C., Sun, Q., Poland, J. A., Kawamoto, K., Buckler, E. S., & Mitchell, S. E. (2011). A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PloS One, 6(5), e19379.

Fisher R.A. (1930). The genetical theory of natural selection. Clarendon Press, UK.

Gienapp, P., Fior, S., Guillaume, F., Lasky, J. R., Sork, V. L., & Csilléry, K. (2017). Genomic quantitative genetics to study evolution in the wild. Trends in Ecology & Evolution, 32(12), 897-908.

Golan, D., Lander, E. S., & Rosset, S. (2014). Measuring missing heritability: Inferring the contribution of common variants. Proceedings of the National Academy of Sciences, 111(49), E5272-E5281.

Gomulkiewicz, R., & Holt, R. D. (1995). When does evolution by natural selection prevent extinction? Evolution, 49, 201-207.

Harper, J. L. (1977). Population biology of plants. New York, NY: Academic Press.

Hendry, A. P., Schoen, D. J., Wolak, M. E., & Reid, J. M. (2018). The contemporary evolution of fitness. Annual Review of Ecology, Evolution, and Systematics, 49, 457-476.

Hill, W. G., & Robertson, A. (1968). Linkage disequilibrium in finite populations. Theoretical and Applied Genetics, 38, 226-231.

Hill, W. G., & Zhang, X. S. (2009). Maintaining genetic variation in fitness. In J. van der Werf, H. U. Graser, R. Frankham, & C. Gondro (Eds.), Adaptation and Fitness in Animal Populations (pp. 67-85). Dordrecht: Springer.

Hohenlohe, P. A., Bassham, S., Etter, P. D., Stiffler, N., Johnson, E. A., & Cresko, W. A. (2010). Population genomics of parallel adaptation in threespine stickleback using sequenced RAD tags. PLoS Genetics, 6(2), e1000862.

Johnston, S. E., Chen, N., & Josephs, E. B. (2022). Taking quantitative genomics into the wild. Proceedings of the Royal Society B, 289, 20221930.

Knaus, B. J., & Grünwald, N. J. (2017). vcfr: A package to manipulate and visualize variant call format data in R. Molecular Ecology Resources, 17(1), 44-53.

Kruuk, L. E., Clutton-Brock, T. H., Slate, J., Pemberton, J. M., Brotherstone, S., & Guinness, F. E. (2000). Heritability of fitness in a wild mammal population. Proceedings of the National Academy of Sciences, 97(2), 698-703.

Kulbaba, M. W., Sheth, S. N., Pain, R. E., Eckhart, V. M., & Shaw, R. G. (2019). Additive genetic variance for lifetime fitness and the capacity for adaptation in an annual plant. Evolution, 73(9), 1746-1758.

Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), 1754-1760.

Lin, S., Medina, C. A., Norberg, O. S., Combs, D., Wang, G., Shewmaker, G., & Yu, L. X. (2021). Genome-wide association studies identifying multiple loci associated with alfalfa forage quality. Frontiers in Plant Science, 12, 648192.

Mackay, T. F., & Lyman, R. F. (2005). Drosophila bristles and the nature of quantitative genetic variation. Philosophical Transactions of the Royal Society B: Biological Sciences360(1459), 1513-1527.

Merilä, J., & Sheldon, B. C. (1999). Genetic architecture of fitness and nonfitness traits: Empirical patterns and development of ideas. Heredity, 83, 103-109.

Mitchell-Olds, T. (1986). Quantitative genetics of survival and growth in Impatiens capensis. Evolution, 40(1), 107-116..

Mitchell-Olds, T., & Bergelson, J. (1990). Statistical genetics of an annual plant, Impatiens capensis. II. Genetic basis of quantitative variation. Genetics, 124(2), 407-415.

Montalvo, A. M., & Shaw, R. G. (1994). Quantitative genetics of sequential life‐history and juvenile traits in the partially selfing perennial, Aquilegia caerulea. Evolution, 48(3), 828-841.

Mousseau, T. A., & Roff, D. A. (1987). Natural selection and the heritability of fitness components. Heredity, 59(2), 181-197.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, et al. (2011). Scikit-learn: Machine learning in Python. the Journal of machine Learning research12, 2825-2830.

Putnam, N. H., O'Connell, B. L., Stites, J. C., Rice, B. J., Blanchette, M., Calef, R., ... & Green, R. E. (2016). Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Research, 26(3), 342-350.

Price, T., & Schluter, D. (1991). On the low heritability of life‐history traits. Evolution, 45(4), 853-861.

R Core Team (2021). RL A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL .

Rochette, N. C., Rivera‐Colón, A. G., & Catchen, J. M. (2019). Stacks 2: Analytical methods for paired‐end sequencing improve RADseq‐based population genomics. Molecular ecology, 28(21), 4737-4754.

Schmitt, J., Eccleston, J., & Ehrhardt, D. W. (1987). Dominance and suppression, size-dependent growth and self-thinning in a natural Impatiens capensis population. The Journal of Ecology, 75, 651-665.

Schwaegerle, K. E., & Levin, D. A. (1991). Quantitative genetics of fitness traits in a wild population of Phlox. Evolution, 45(1), 169-177.

Schwinning, S., & Weiner, J. (1998). Mechanisms determining the degree of size asymmetry in competition among plants. Oecologia113, 447-455.

Shaw, R. G., & Etterson, J. R. (2012). Rapid climate change and the rate of adaptation: insight from experimental quantitative genetics. New Phytologist, 195(4), 752-765.

Shaw, R. G., & Shaw, F. H. (2014). Quantitative genetic study of the adaptive process. Heredity, 112(1), 13-20.

Seppey, M., Manni, M., & Zdobnov, E. M. (2019). BUSCO: assessing genome assembly and annotation completeness. Gene prediction: methods and protocols, 227-245.

Silvertown, J., & Charlesworth, D. (2009). Introduction to plant population biology. John Wiley & Sons.

Speed, D., Holmes, J., & Balding, D. J. (2020). Evaluating and improving heritability models using summary statistics. Nature Genetics, 52(4), 458-462.

Stanton‐Geddes, J., Yoder, J. B., Briskine, R., Young, N. D., & Tiffin, P. (2013). Estimating heritability using genomic data. Methods in Ecology and Evolution, 4(12), 1151-1158.

Stevens, L., Goodnight, C. J., & Kalisz, S. (1995). Multilevel selection in natural populations of Impatiens capensisThe American Naturalist, 145(4), 513-526.

Thomas, S. C., & Bazzaz, F. A. (1993). The genetic component in plant size hier-archies: norms of reaction to density in a Polygonum species. Ecological Monographs, 63(3), 231-249.

Turner, M. D., & Rabinowitz, D. (1983). Factors affecting frequency distributions of plant mass: the absence of dominance and suppression in competing monocultures of Festuca paradoxa. Ecology, 64(3), 469-475.

Waller, D. M. (1985). The genesis of size hierarchies in seedling populations of Impatiens capensis Meerb. New Phytologist, 100(2), 243-260.

Weiner, J. (1985). Size hierarchies in experimental populations of annual plants. Ecology, 66(3), 743-752.

Weiner, J. A. (1988) The influence of competition on plant reproduction” In J. L. Doust & J. D. Doust (Eds.), Plant reproductive ecology: patterns and strategies (pp. 228-245). Oxford Univ. Press, NY.

Weiner, J. (1990). Asymmetric competition in plant populations. Trends in Ecology & Evolution, 5(11), 360-364.

Yang, J., Lee, S. H., Goddard, M. E., & Visscher, P. M. (2011). GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1), 76-82.

Usage notes

LDAK v. 5.2 package of programs (

STACKS version 2.60

bwakit v. 0.7.12 a

SAMtools v. 1.13

R version 4.1.1

vcfR v. 1.12 (an R package)

SNPfiltR v. 1.01 (an R package)

Plink v. 1.9


Natural Sciences and Engineering Research Council, Award: N/A

Danmarks Frie Forskningsfond, Award: 7025-00094B