Skip to main content

Genomic characterization of a wild-like tomato accession from Arizona

Cite this dataset

Barnett, Jacob et al. (2022). Genomic characterization of a wild-like tomato accession from Arizona [Dataset]. Dryad.


Tomato domestication history has been revealed to be a highly complex story. A major contributor to this complexity is an evolutionary intermediate group (Solanum lycopersicum var. cerasiforme (Alef.) Voss; SLC) between the cultivated tomato (Solanum lycopersicum var. lycopersicum L.; SLL) and its wild relative (Solanum pimpinellifolium L.; SP). SLC includes accessions with a broad spectrum of genomic and phenotypic characteristics. Some of the SLC accessions were previously hypothesized to be spreading northward from South America into Mesoamerica and that migration probably entailed reversal to wild-like phenotypes such as smaller fruits. Prior to this study, the northernmost confirmed extension of the SLC was limited to northern Mexico. In this study, we employed genomic methods to investigate the origin of a wild-like tomato found in a garden in Scottsdale Arizona, USA. The so-called “Arizona tomato” featured a vigorous growth habit and carried small fruits weighing 2-3 grams. Our phylogenomic analyses revealed the identity of the Arizona tomato as a member of the Mexican SLC population (SLC MEX). To our knowledge, this is the first report of an SLC accession, confirmed using genomics, growing spontaneously in Arizona. This finding could have implications for conservation biology as well as agriculture.


Plant collection and voucher specimens: The "Arizona tomato" was discovered growing naturally in a suburban yard in Scottsdale, Arizona, USA in January 2021 by April Kuipers. All plant materials used in this study were obtained with permission from Ms. Kuipers, who found the plant on her personal property. She collected seeds from the fruits and sent them to the University of Massachusetts Amherst, where plants were grown in a greenhouse kept at 22 ℃ (with a few degrees of day/night fluctuations) under ambient daylight conditions from February - August 2021. Seeds were soaked in 2% sodium hypochlorite bleach solution, rinsed with distilled water, then placed ~ 0.5 cm deep in ProMix BX potting medium, resulting in 100% germination rate (nine out of nine seeds). Two plants were grown to maturity, and two specimens from each plant were preserved on June 30, 2021, and deposited in the University of Massachusetts Herbarium (MASS) as vouchers. 

Genome sequencing: Genomic DNA was extracted from a plant grown at Cornell University. Approximately 4 grams of leaf samples from the youngest leaves were used.  Leaf samples were frozen and ground in liquid nitrogen. The DNA purification was conducted using NucleoBond® HMW DNA kit (from TaKaRa) and the purification steps were followed exactly as described in the protocol accompanying the kit. The total DNA yield from the leaf samples was 500 ng per microliter, of which 600 ng was used as input for DNA sequencing library preparation. The sequencing library preparation was conducted using New England Biolab FS DNA Library Prep Kit (E7805, E6177). A paired-end library was made with fragment sizes of 100-300 bp. The resulting library was visualized on a 0.4% agarose gel to confirm the expected spectrum of fragment sizes. The library was sequenced with Illumina NextSeq technology using the paired-end mode (2 × 150 bp) at the Genomics Facility of the Cornell Institute of Biotechnology. 

Sequence read alignment and variant calling: The reads obtained through genome sequencing were filtered by removing low-quality reads and the universal Illumina adaptors using cutadapt (v. 2.1; Martin 2011) with the default options. The remaining reads were checked for quality with FastQC (v. 0.11.8; and aligned to the most recent version of the cultivated tomato reference genome (SL4.0) using mem from bwa (v. 0.7.17; Li 2013). Similarly, genome alignments of SP, SLC, and SLL accessions created in previous studies (Razifard et al. 2020 and Pereira et al. 2021) were also included to create a variants dataset. A combined dataset of 21,358,376 variants including SNPs and indels (insertions and deletions < 10 bp) was made using mpileup from bcftools (v. 1.14; Danecek et al. 2021) software package. From this dataset, rare alleles (minimum allele frequency < 0.02) were removed using vcftools (v. 0.1.17; Danecek et al. 2011). 

Phylogenetic reconstruction: A SNP phylogeny was built using only four-fold degenerate SNPs ("4D SNPs”), i.e. SNPs at the third codon positions in which changes to all four base-pairs do not affect the translated amino acid from those codons. Such codon positions are considered to be under less selective pressure, thus more suitable for studying phylogenetic relationships. The input dataset for the phylogenetic analyses was made as follows: A custom script (available from was developed in R (v. 4.0.5; to extract 4D SNPs from the variants dataset described above. From the resulting 4D SNPs dataset, we kept only those with no missing data and minimum allele frequency > 0.012 (to keep 4D SNPs with alternate alleles in at least two homozygous accessions, four heterozygous accessions, or some other genotype combination). These steps resulted in 41,787 4D SNPs. For drawing a phylogenetic tree using the 4D SNPs, we used the coalescent method of SVDQuartets (Chifman and Kubatko, 2014) included in PAUP (v. 4a168; Swofford DL. 2003). The “exhaustive” search option, i.e. including all possible quartets, was chosen for the SVDQuartets analysis. 

As an alternative method, we also generated a phylogeny based on a genome-wide dissimilarity matrix calculated from 13,956,415 variants (all variants except those with minimum allele frequency < 0.012 and missing data > 10%) in all accessions using SNPRelate (v1.10.2; Zheng,2012) in R. We then used the resulting distance matrix to create the phylogeny, based on the default “complete linkage” method.


National Science Foundation, Award: 1564366

National Science Foundation of Sri Lanka, Award: 1942437