Skip to main content

Genomic structural variants constrain and facilitate adaptation in natural populations of Theobroma cacao, the Chocolate Tree

Cite this dataset

Hämälä, Tuomas et al. (2021). Genomic structural variants constrain and facilitate adaptation in natural populations of Theobroma cacao, the Chocolate Tree [Dataset]. Dryad.


Genomic structural variants (SVs) can play important roles in adaptation and speciation. Yet, the overall fitness effects of SVs are poorly understood, partly because accurate population-level identification of SVs requires multiple high-quality genome assemblies. Here, we use 31 chromosome-scale, haplotype-resolved genome assemblies of Theobroma cacao – an outcrossing, long-lived tree species that is the source of chocolate – to investigate the fitness consequences of SVs in natural populations. Among the 31 accessions, we find over 160 thousand SVs, which together cover eight times more of the genome than SNPs and short indels (125 Mb vs. 15 Mb). Our results indicate that a vast majority of these SVs are deleterious: they segregate at low frequencies and are depleted from functional regions of the genome. We show that SVs influence gene expression, which likely impairs gene function and contributes to the detrimental effects of SVs. We also provide empirical support for a theoretical prediction that SVs, particularly inversions, increase genetic load through the accumulation of deleterious nucleotide variants as a result of suppressed recombination.
Despite the overall detrimental effects, we identify individual SVs bearing signatures of local adaptation, several of which are associated with genes differentially expressed between populations. Genes involved in pathogen resistance are strongly enriched among these candidates, highlighting the contribution of SVs on this important local adaptation trait. Beyond revealing new empirical evidence for the evolutionary importance of SVs, these 31 de novo assemblies provide a valuable resource for genetic and breeding studies in T. cacao. 


Sample collection and DNA extraction

We selected 31 cacao accessions, representing four populations (Iquitos, n = 8; Nanay, n = 7; Marañon, n = 8; Guiana), for sequencing and assembly. Leaf samples (early stage E) were collected from trees at the Centro Agronómico Tropical de Investigació y Enseñanza (CATIE) field station, Costa Rica. The leaves were flash-frozen and ground under liquid nitrogen. DNA was extracted using a CTAB protocol (1), with the following modifications: 0.25 – 0.3 g tissue / 15 ml lysis buffer, which contained 7% PVP (2) and 4% CTAB; isopropanol precipitation was carried out at –20°C overnight; after two washes the pelleted DNA was transferred, using a sterile glass rod, to 400 μl TE; pellets were then o resuspend at 4°C overnight; once fully resuspended, samples were digested with 1 μl RNase A (Qiagen #19101, 30 min at 37°C), then 10 μl Proteinase K (Qiagen #19131, 2 hours at 50°C) before performing the organic extractions using 1 volume of phenol:chloroform:isoamyl alcohol (25:24:1), then 1 volume of chloroform:isoamyl alcohol (24:1); after the (ammonium acetate) precipitation step, the DNA pellets were resuspended in 100 – 150 μl TE at 4°C overnight.

If a cloudy precipitate was observed during the ammonium acetate precipitation step above, samples were taken through another round of clean-up (the two organic extractions followed by ammonium acetate precipitation), performed large scale (in 4.5 ml TE buffer). Typically, multiple preps were pooled to perform this additional clean-up. The precipitated DNA was spooled on a glass rod, rinsed in 70% ethanol, air-dried ~ 3 minutes, then allowed to resuspend in 200 μl of TE at 4°C overnight.

After performing Qubit 2.0 Fluorometer (Invitrogen) concentration estimates, the DNA samples were split into several 1.5 ml tubes (5 – 7 μg / tube) and taken through a high salt gDNA clean-up protocol (PacBio: designed to remove residual polysaccharide contamination. Final pellets were resuspended in 25 – 30 μl TE overnight at 4°C before pooling and taking final Qubit concentration and NanoDrop (ThermoFisher) spectrophotometer readings. 500 ng of each sample was run on a Bio-Rad CHEF pulsed-field gel (1% gel, 0.5 X TBE, 14 C, 6V/cm, 1-6 sec. switch times for 11 hours, followed by 4.5 V/cm, 50-90 sec. switch times for 11 hours) to verify that the mean fragment size was above 48 Kb.


Genome assembly and pseudomolecule construction

10x Genomics Chromium libraries were constructed and sequenced (Illumina NovaSeq 6000, 150 bp paired-end) at the UC Davis Genome Center to a ~150-fold coverage (420 M reads) per accession. The estimated mean molecule length of linked-reads ranged from 18 to 39 Kb per Chromium library, with at least 50% of the reads nonduplicated and phased. Haploid genome size estimates based on k-mer frequency distributions ranged from 393 to 430 Mb, consistent with cacao genome sizes estimated using flow cytometry. 

We assembled the linked-reads from each accession with Supernova v2.1.1, the 10x Genomics de novo assembler, and translated the resulting phased assembly graphs to produce two parallel pseudo-haplotype sequence representations of each genome. Despite sequencing ~150-fold coverage for each of the 31 accessions, the Supernova assembler can handle only 30 – 85-fold coverage for the k-mer estimated genome size. We, therefore, set Supernova to randomly sample slightly less than half of the reads (~65-fold raw read coverage) for each accession. The assembled sequence contains regions of gapless contiguity (contigs), as well as regions containing intervening gaps present in the linked-reads (scaffolds).

After producing the initial pseudo-haplotype assemblies, we excluded non-plant sequences that were identified using MegaBLAST queries against the NCBI nucleotide database. Sequences for which the best match (e-value < 1 × 10-10) were not classified as embryophyte (land plants) were treated as contaminants and discarded. We performed a second iteration of MegaBLAST searches against the NCBI RefSeq plant organelle database and removed assembly sequences with high similarity (> 80% identity, > 50% coverage) to organelle sequences. The remaining nuclear contigs and scaffolds were ordered and oriented into chromosome-scale pseudomolecules with RaGOO, using the cacao cultivar Matina 1-6 (v1.1) reference chromosomes. By using a closely related reference to construct pseudomolecules, RaGOO can facilitate the identification of large SVs that span multiple contigs. 

Usage notes

Metadata for raw data and assembly available in the accompanying Excel spreadsheet (Genome_Assembly_Summary.xlsx).


National Science Foundation, Award: Grant IOS-1546863

National Institute of Food and Agriculture, Award: Project PEN04569 and Accession Number 1003147

National Institute of Food and Agriculture, Award: Project PEN04569 and Accession Number 1003147