The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars
Data files
Dec 22, 2023 version files 21.48 GB
-
Accession_info.xlsx
-
Arabica_sgC.TIP.BB.vcf.gz
-
Arabica_sgE.TIP.BB.vcf.gz
-
Coffea_syntenic_alignments.tar.gz
-
README.md
Abstract
Coffea arabica, an allotetraploid hybrid of C. eugenioides and C. canephora, is the source of approximately 60% of coffee products worldwide, and its cultivated accessions have undergone several population bottlenecks. We present chromosome-level assemblies of a di-haploid C. arabica accession and modern representatives of its diploid progenitors, C. eugenioides and C. canephora. The three species exhibit largely conserved genome structures between diploid parents and descendant subgenomes, with no obvious global subgenome dominance. We find evidence for a founding polyploidy event 350,000-610,000 years ago, followed by several pre-domestication bottlenecks, resulting in narrow genetic variation. A split between wild accessions and cultivar progenitors occurred ∼30.5 kya, followed by a period of migration between the two populations. Analysis of modern varieties, including lines historically introgressed with C. canephora, highlights their breeding histories and loci that may contribute to pathogen resistance, laying the groundwork for future genomics-based breeding of C. arabica.
README: The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars
https://doi.org/10.5061/dryad.qnk98sfpt
The dataset contains two items: (i) syntenic alignments between C. canephora, C. eugenioides and C. arabica assemblies, and (ii) the variant calls used in the population analyses in the paper.
Description of the data and file structure
i. Syntenic alignments have been obtained in CoGe SynMap tool using default settings. In file names, the first two items give the CoGe IDs of the genomes being aligned: C. canephora - ID50947; C. eugenioides - ID51132; C. arabica subCC - ID65471; C. arabica subEE - ID65472; C. arabica - ID65463. The contents of the columns in syntenic alignments are described on row 3 of the files, and on row 1 in tandem duplicate files (which can be identified as having .tandems. in their names).
ii. The variant calls are given in VCF formatted files. Each subgenome has its own file, Arabica_sgC.TIP.BB contains variant calls for subgenome CC of C. arabica, and Arabica_sgE.TIP.BB for subgenome EE, respectively.
Variants have been filtered for SNPs that were called as heterozygous in di-haploid C. arabica accession Et39, but otherwise no quality filtering for the variants has been done in these files. See Supplementary Material, Section 6.2 for the specific filterings carried out in the publication.
Mapping between the sequencing IDs and accession names is given in the provided Excel sheet (Accession_info.xlsx). Seq.ID (column B) shows the accession ID used in sequencing. accession_name (column C) gives the name of the accession/cultivar. Species_name (Column D) provides the species of the accession, three different Coffea species were analysed in this study. Variety (column E) gives information on the cultaivation status, Introgressed identifies C. arabica x C. canephora hybrids. Columns F-J provide the place of origin of the accession (district/location, country, as well as GPS coordinates). The cells are left empty if the exact value (GPS coordinate or altitude) is not known. Columns K-M provide genome information, ploidy level, estimated genome size and genome structure. Columns N-R give additional information, donor institute, collection location, additional notes on the accession as well as original reference. If exact collection location is not known the cell is left empty. In those cases the material has been obtained from line(s) maintained by the donor institute (column N). Cells in columns Q-R are left empty if there is no (known) original publication associated with the accession.
Code/Software: For the syntenic alignments, CoGe platform was used. For the variant calls,. Linux operating system and GATK was used to obtain the VCF files, subsequent analysis was carried out using R, Plink, vcftools, smc++.
Methods
For syntenic alignments, the assemblies were aligned in CoGe platform (https://genomevolution.org/coge/) using default settings.
For the resequencing of 38 wild and cultivated Coffea arabica, two wild C. eugenioides, two cultivated and one wild C. canephora accessions, libraries were prepared using the KAPA HyperPrep Kits (Roche) following manufacturer's instructions, and paired-end (2 x 125) sequenced on a Illumina HiSeq2500 instrument to ~40x coverage. Additionally, Linnaean herbarium sample was sequenced to 46x coverage with Ion Torrent technology.
Following quality control with FastQC, Illumina short reads were trimmed using Trimmomatic v0.36 and mapped on the C. arabica reference assembly with BWA mem v0.7.16a-r1181. For the Linnaean sample, the reads were processed according to the protocols recommended for degraded DNA analysis in MapDamage v.2.0.8. GATK (v 3.8.0) pipeline was used for SNP calling. Duplicates were marked and removed using Picard v2.0.1 and genotype likelihoods were called into GVCF files using HaplotypeCaller (GATK). For the diploid progenitors, to allow interspecies comparisons, the mapping was done to each of the subgenomes separately, including chromosome zero, i.e., contigs not assembled into pseudomolecules, in both mappings. Joint calling was carried out using GenotypeGVCFs (GATK) and snpEff v4.3t was used to assess the impact of the SNPs. To remove regions with cross-species mappings, we removed the SNPs that were called as heterozygous when mapping the di-haploid ET39 sequencing data to the Arabica reference genome.