Skip to main content
Dryad logo

Something old, something new: evolution of Colombian weedy rice (Oryza spp.) through de novo de-domestication, exotic gene flow, and hybridization


Hoyos, Veronica; Plaza, Guido; Caicedo, Ana L. (2020), Something old, something new: evolution of Colombian weedy rice (Oryza spp.) through de novo de-domestication, exotic gene flow, and hybridization, Dryad, Dataset,


Weedy rice (Oryza spp.) is a worldwide weed of domesticated rice (O. sativa), considered particularly problematic due to its strong competition with the crop, which leads to reduction of yields and harvest quality. Several studies have established multiple independent origins for weedy rice populations in the U.S. and various parts of Asia; however, the origins of weedy rice in South America have not been examined in a global context. We evaluated the genetic variation of weedy rice populations in Colombia, as well as the contributions of local wild Oryza species, local cultivated varieties, and exotic Oryza groups to the weed, using polymorphism generated by genotyping by sequencing (GBS). We found no evidence for genomic contributions from local wild Oryza species (O. glumaepatula, O. grandiglumis, O. latifolia and O. alta) to Colombian weedy rice. Instead, Colombian weedy rice has evolved from local indica cultivars, and has also likely been inadvertently imported as an exotic pest from the US. Additionally, weeds comprising de novo admixture between these distinct weedy populations now represent a large proportion of genomic backgrounds in Colombian weedy rice. Our results underscore the impressive ability of weedy rice to evolve through multiple evolutionary pathways, including in situ de-domestication, range expansion, and hybridization.


Plant material and DNA extraction

Weedy rice seeds were collected by sampling in the five principal rice production areas of the country in 2014-2015. Twenty-six accessions were collected in the Central zone, 34 in the Llanos plains, 28 in the Bajo Cauca river valley, 26 in the North coast and 26 in the Santanderes area, for a total of 140 Colombian weedy rice (CWR) accessions. These were additionally classified according to hull color (lemma and palea), pericarp color, grain size, and awn presence, abundance and length. Nine varieties of commercial rice were supplied by the National Federation of Rice Growers (Fedearroz), which included: four varieties that are currently cultivated and have been in the market for 10 to 17 years (F473 cultivated in the low Cauca valley area, F2000 in Santanderes and North coast zones, F174 in the Eastern plains and F60 in the Central zone), one variety with national relevance and cultivated for approximately 17 years (F50), and four varieties that have left the market but were important over a 25 year period (Cica 6, Cica 8 and Cica 9 and Orizica1). Additionally, 19 landraces (traditional crop with manual dry seeding system, present in the low Cauca valley area) that belong to the germplasm bank of the Fedearroz Monteria section were sampled. We also obtained seeds of 22 wild South America Oryza from the International Rice Research Institute (IRRI); these included seven O. glumaepatula, six O. grandiglumis, five O. alta, and four O. latifolia.

Staggered sowing of all selected materials (CWR, commercial rice and landraces) was carried out in the Weed Science greenhouse at Universidad Nacional de Colombia. In order to break dormancy, weeds were exposed to 50°C for 12 hours (materials that did not germinate under this treatment were kept at 50°C for 72 hours). South American wild rice materials were planted in the Biology greenhouse at the University of Massachusetts, following a dormancy breaking treatment at 50°C for 5 days. 200 mg of young leaf tissue were used for DNA extraction with the Qiagen DNeasy® Plant Mini Kit. The concentration and purity of the DNA in the samples was quantified with a Qubit 2.0 fluorometer; 30 μl of DNA with a concentration of 30-100 ng/μl for each sample was sent to the Cornell University Biotechnology Resource Center (BRC) for genotyping by sequencing (GBS).

GBS library preparation and sequence analysis

GBS was used to detect polymorphisms distributed throughout the genome among our samples (Elshire et al., 2011). Restriction digestions were carried out with the enzyme ApeKI, and the fragments were ligated with individual barcoded and common adapters. DNA fragments were pooled for further PCR amplification to enrich the libraries. Single end fragments of 100 base pairs (bp) were sequenced on an Illumina HiSeq 2500 platform. Raw GBS data has been deposited in the NCBI SRA (experiment SUB6244405). Initial data processing was performed at the Biotechnology Institute of Cornell University with the TASSEL-GBS v.3.0 pipeline (Glaubitz et al., 2014) and the MSU7 rice reference genome. The first filters removed SNPs with minimum minor allele frequency <0.01 or missing data per site >90%. Additional filters were applied to the initial VCF file with the NGSEP pipeline (Duitama et al., 2014) at the University of Massachusetts. Final SNPs were supported by at least five reads (for the analysis of Colombian material) or three reads (for the global analysis), and displayed heterozygosity levels of <50%. SNPs with more than 15% missing data and individuals with more than 70% missing data were removed. Only biallelic SNPs were retained, filtering any other type of variants.

Data generated for Colombian and South American samples was integrated with other published GBS databases that included 128 cultivars, 173 weedy rice samples and 53 samples of the wild ancestor of cultivated rice (O. rufipogon/O. nivara) from Asia (Huang et al., 2017; Vigueira et al., 2019), nine cultivars and 17 weedy rice samples from the United States (Burgos et al., 2014), and four outgroup samples (O. meridionalis and O. barthii) from these same datasets (NCBI Short Read Archive experiment SRX576894), for a total of 574 accessions .

Population analyses

Population structure was analyzed using STRUCTURE (version 2.3.3, Hubisz, Falush, Stephens, & Pritchard, 2009), on the  Massachusetts Green High Performance Computing Cluster ( Due to the limitation in the amount of input data that can be handled by the program (Falush, Stephens, & Pritchard, 2003; Pritchard, Stephens, & Donnelly, 2000), approximately 10,000 SNPs were randomly selected for each analysis with a roughly 15,000 base pairs (bp) spacing. Heterozygotes were recorded as "N" and all the simulations were run with data coded as haploid, because weedy and cultivated rice are highly self-pollinated. The program was run for an ancestral "admixture" model, with no correlated allele frequencies. Runs were carried out using K values between 1 and 15, with three replicates per K, a burn-in period of 100,000 and 500,000 subsequent iterations. The optimal number of genetic groupings was determined using ΔK (Evanno, Regnaut, & Goudet, 2005) according to the program Structure Harvester (Earl & vonHoldt, 2012). The program CLUMPP (Jakobsson & Rosenberg, 2007) was used to obtain a single Q matrix for each K. The final matrix for each K value was visualized with Distruct (Rosenberg, 2004). For comparison, we also analyzed our complete SNP dataset for worldwide samples with the Bayesian clustering analysis fastStructure (version 1.0, Raj, Stephens, & Pritchard, 2014), without recoding heterozygotes, with no prior grouping. FastStructure runs were conducted for K from 2 to 8.  The best K was determined through, and the POPHELPER R package was used to generate an image (Francis, 2017).

To investigate the genetic divergence among individuals for all SNPs, the program SmartPCA from the Eigensoft package (Patterson, Price, & Reich, 2006; Price et al., 2006) was used. Figures with eigenvalues as coordinates were generated by RStudio 1.0.143.

To infer the phylogenetic relationships among samples, RAxML (Randomized Axelerated Maximum Likelihood) version 8 (Stamatakis, 2014) was used. The RAxML HPC2 on XSEDE tool was selected in the CIPRES portal (, with GTRGAMMA model and 1000 bootstraps. Because SNP data only presents variable sites, ascertainment bias correction (ASC) was performed. The best phylogenetic tree result was plotted using iTol v4 (Letunic & Bork, 2016).

Genetic diversity for each population was measured by evaluating the expected heterozygosity calculated for all loci and paired FST was used to estimate the genetic differences among populations  with the software ARLEQUIN (ver, Excoffier & Lischer, 2010). Additionally, an AMOVA was performed to analyze variation among and within populations.


National Science Foundation, Award: IOS-1032023