Skip to main content

Capturing single-copy nuclear genes, organellar genomes, and nuclear ribosomal DNA from deep genome skimming data for plant phylogenetics: A case study in Vitaceae

Cite this dataset

Liu, Bin-Bin et al. (2021). Capturing single-copy nuclear genes, organellar genomes, and nuclear ribosomal DNA from deep genome skimming data for plant phylogenetics: A case study in Vitaceae [Dataset]. Dryad.


With the decreasing cost and availability of many newly developed bioinformatics pipelines, next-generation sequencing (NGS) has revolutionized plant systematics in recent years. Genome skimming has been widely used to obtain high-copy fractions of the genomes, including plastomes, mitochondrial DNA (mtDNA), and nuclear ribosomal DNA (nrDNA). In this study, through simulations, we evaluated the optimal (minimum) sequencing depth and performance for recovering single-copy nuclear genes (SCNs) from genome skimming data, by subsampling genome resequencing data and generating 10 datasets with different sequencing coverage in silico. We tested the performance of four datasets (plastome, nrDNA, mtDNA, and SCNs) obtained from genome skimming based on phylogenetic analyses of the Vitis clade at the genus level and Vitaceae at the family level, respectively. Our results showed that optimal minimum sequencing depth for high-quality SCNs assembly via genome skimming was about 10× coverage. Without the steps of synthesizing baits and enrichment experiments, coupled with incredibly low sequencing costs, we showcase that deep genome skimming (DGS) is as effective for capturing large datasets of SCNs as the widely used Hyb-Seq approach, in addition to capturing plastomes, mtDNA, and entire nrDNA repeats. DGS may serve as an efficient and economical alternative and may be superior to the popular target enrichment/Hyb-Seq approach.


2.1 Sequencing depths for each case

For the genus-level case, we used the genome resequencing dataset of Vitis by Ma et al. (2018), representing 41 species sequenced using Illumina Hi-Seq (NCBI Short Read Archive SRP161488 under the BioProject PRJNA490319). Detailed species and voucher information can be found in Ma et al. (2018) and Table S1. The sequencing depths of these 41 datasets ranged from 17.5× to 33.4× coverage with approximately 20× coverage on average (Table S1), assuming an estimated genome size of around 487 Mb based on the Vitis vinifera L. genome (Jaillon et al., 2007).

For the family-level case, we used the low-coverage genome skimming dataset of 27 Vitaceae species generated on an Illumina Next-Seq instrument by Zhang et al. (2015) to test the effectiveness of simulation results in Vitaceae. All raw data were downloaded from the GenBank with the BioProject accession number PRJNA298058 and the sequencing depths ranged from 4× to 7.4× coverage (average 5.6× coverage). Detailed species and voucher information are available in Zhang et al. (2015) and Table S2.

2.2 Capturing single-copy nuclear genes in silico via genome skimming

2.2.1 Data subset creation

For the dataset of 41 genome resequencing samples of Vitis, we used the python script (Piro et al., 2017) to make random draws from the raw data files. Given the nearly even sequencing depths in these 41 genome resequencing samples and the ease of subsampling, nine different subset sequencing depths were generated from the raw data. These included 2× (10%), 4× (20%), 6× (30%), 8× (40%), 10× (50%), 12× (60%), 14× (70%), 16× (80%), and 18× (90%). This range was selected because we expected the minimum sequencing depth for success would fall within this range.

2.2.2 Single-copy nuclear marker development

Targeted nuclear genes were selected from the coding regions of Vitis vinifera (GenBank assembly accession: GCA_000003745.2). The coding sequences were first submitted to MarkerMiner v.1.0 (Chamala et al., 2015) to identify the putative single-copy genes. The genome of V. vinifera as a proteome reference has been integrated into MarkerMiner, and the default settings of the program were followed, except that the minimum sequence length was set as “600 bp” in order to acquire more candidate genes. The resulting genes were then filtered by successively BLASTing (Altschul et al., 1990, 1997; Camacho et al., 2009) them against four available Vitis genomes (V. aestivalis Michx., GCA_001562795.1; V. cinirea (Engelm.) Millardet × V. riparia Michx., GCA_001282645.1; V. riparia, GCA_004353265.1; and V. vinifera, GCA_000003745.2) in Geneious Prime (Kearse et al., 2012), with the parameter settings in the Megablast program (Morgulis et al., 2008) as a maximum of 60 hits, a maximum E-value of 1 × 10-10, a linear gap cost, a word size of 28, and scores of 1 for match and -2 for mismatch in alignments. We first excluded the genes with mean coverage > 1.1 for alignments, which generally would suggest potential paralogy of the genes and/or the presence of highly repeated elements in the sequences. The remaining alignments were further visually examined to exclude those genes receiving multiple hits with long overlapping but different sequences during BLASTing. It should be noted that the alignments with mean coverage between 1.0 and 1.1 were generally caused by the presence of tiny pieces of flanking intron sequences in the alignments. These fragments were still accepted as an SCN here. After the filtration, the remaining genes were used as references in the following gene assembly.

2.2.3 Targeting nuclear single-copy genes

We used Trimmomatic v. 0.39 (Bolger et al., 2014) for quality trimming and adapter clipping, by removing the leading/trailing low quality or below quality three bases, scanning the read with a 4-base wide sliding window, cutting when the average quality per base drops below 14, and dropping reads below 36 bases long. Subsequently, the results were quality-checked using FastQC v. 0.11.9 (Andrews, 2018). The HybPiper pipeline v. 1.3.1 (Johnson et al., 2016) was used for targeting SCNs with default settings; BWA v. 0.7.1 (Li & Durbin, 2009) to align and distribute reads to target genes; SPAdes v. 3.15.0 (Bankevich et al., 2012) with a coverage cutoff value of 5 to assemble reads to contigs; and Exonerate v. 2.2.0 (Slater & Birney, 2005) to align assembled contigs to target sequences and determine exon-intron boundaries. Python and R scripts included in the HybPiper pipeline (Johnson et al., 2016) were used to retrieve the recovered gene sequences, and to summarize and visualize the recovery efficiency. The final alignment of the SCNs from the 10 subsampling datasets are available from the Dryad Digital Repository: (Liu et al., 2021).

2.3 Assembly of plastomes and nrDNA repeats by a successive method

To obtain high-quality plastomes and nrDNA repeats, a two-step strategy was used for assembly. NOVOPlasty v. 4.3.1 (Dierckxsens et al., 2016) was applied first to assemble the plastomes with high-quality raw data and then we used the successive assembly approach by Zhang et al. (2015), combining the reference-based and the de novo assembly methods to assemble the remaining low-quality samples. With the de novo assembly and a seed-and-extend algorithm, NOVOPlasty was the least laborious approach and resulted in the most accurate plastomes; however, this program needs sufficient high-quality raw reads without gaps to cover the whole plastome. The whole plastomes assembled from NOVOPlasty then could be used as references for assembling the remaining samples. The successive method provided us with a good approach to obtain relatively accurate and nearly complete plastomes with or without gaps from lower-coverage raw data. Due to the sensitivity of Bowtie2 v. 2.4.2 (Langmead & Salzberg, 2012) to the reference, this successive method needs a closely related reference sequence with more time and RAM requirement. Because the nrDNA copies are arranged in tandem repeats, we tentatively treated the complete nrDNA as circular in order to use the plastome assembly software (NOVOPlasty) for the nrDNA assembly, and the steps we used were nearly the same as the assembly procedure of plastomes as described above. The detailed procedure has been described in several recent studies (Zhang et al., 2015; Liu et al., 2019, 2020a, 2020b; Wang et al., 2020). All the assembled plastomes and nrDNA repeats have been submitted to GenBank with the accession numbers listed in Table S1 & S2.

2.4 Assembly of mitochondrial genes

Given the highly variable structure and the recurrent rearrangements in plant mitogenomes, we extracted the genic portion of Vitis vinifera mitogenome for phylogenetic analyses by Geneious Prime (Kearse et al., 2012), in which 37 mitochondrial origin protein-coding genes (38 genes, as rps19 has two functional full gene copies) were included (Goremykin et al., 2009). It should be noted that the mitochondrial origin rRNA, tRNA, and hypothetical genes (Goremykin et al., 2009) were not included in the targeted gene list. All of these 38 genes in Vitis vinifera were used as the reference in HybPiper (Johnson et al., 2016) to capture the mtDNA sequences for the other species with the coverage cutoff 5 and the other parameters by default. The final alignments are available from the Dryad Digital Repository (Data S12&S13): (Liu et al., 2021).

2.5 Sequence annotation and alignment

The assembled plastomes from the low-coverage and high-coverage datasets were annotated using PGA (Qu et al., 2019) with a closely related plastome (MT267294: Nekemias grossedentata (Hand.-Mazz.) J.Wen & Z.L.Nie) downloaded from GenBank as the reference, and the results of automated annotation checked manually. The coding sequences of plastomes were translated into proteins to check the start and stop codons manually in Geneious Prime (Kearse et al., 2012). The custom annotations in the GenBank format were converted into the FASTA and five-column feature tables file required by NCBI submission using GB2sequin (Lehwark & Greiner, 2019).

The entire nrDNA sequence includes six regions, ETS, 18S, ITS1, 5.8S, ITS2, and 26S; the non-transcribed spacer region between the 26S and the ETS was excluded here due to the ambiguous alignment. All sequences from plastome, mtDNA, nrDNA repeats, and SCNs assembled here were aligned separately by MAFFT v. 7.475 (Nakamura et al., 2018) with default parameters. Specifically, as the sequences of the two IR regions of the plastome in each accession of the grape family were completely or nearly identical, only one copy of the inverted repeat (IR) region was included for the whole plastome phylogenetic analyses. To reduce the systematic errors produced by poor alignment, we used trimAL v. 1.2 (Capella-Gutiérrez et al., 2009) to trim the alignment of these sequences, in which all columns with gaps in more than 20% of the sequences or with a similarity score lower than 0.001 were removed. Each aligned sequence of SCNs, plastid coding sequences (CDS), nrDNA regions, and mtDNA genes was concatenated by AMAS v. 1.0 (Borowiec, 2016), respectively, and the resulting alignment summaries of each dataset have been used to estimate the partition of each gene.

2.6 Phylogenetic analyses

Bayesian inferences (BI) were run for the whole plastome of Vitis, plastid CDS of Vitaceae, mtDNA, and nrDNA dataset at the genus level of Vitis and the family level of Vitaceae separately. The best-fit partitioning schemes and/or nucleotide substitution models for each dataset were estimated using PartitionFinder2 (Stamatakis, 2006; Lanfear et al., 2016), under the corrected Akaike information criterion (AICc) and linked branch lengths, as well as with greedy (Lanfear et al., 2012) for the nrDNA dataset and rcluster (Lanfear et al., 2014) algorithm options for plastid CDS and mtDNA dataset. The partitioning schemes and evolutionary model for each subset were used for the downstream BI analyses. The BI tree was inferred with MrBayes 3.2.7 (Ronquist et al., 2012). The Markov chain Monte Carlo (MCMC) analyses were run for 100,000,000 generations, and sampled at every 2,000 generations with the first 25% discarded as burn-in. The remaining trees were used to build a 50% majority-rule consensus tree. The stationarity was considered to be reached when the average standard deviation of split frequencies remained below 0.01. The BI tree was visualized using Geneious Prime (Kearse et al., 2012).

For the SCNs, we inferred phylogenies using both concatenation and species tree methods. For the concatenation analysis, we used the aforementioned PartitionFinder2 (Stamatakis, 2006; Lanfear et al., 2016) to estimate the best partitioning schemes for each gene with the parameters same as above except for using the rcluster algorithm option, and the resulted schemes were then used to infer Maximum Likelihood (ML) trees with RAxML 8.2.12 (Stamatakis, 2014). For estimating the coalescent species tree, we searched for the best-scoring ML trees and performed 100 rapid bootstraps employing the option “-f a” in RAxML 8.2.12 (Stamatakis, 2014), using an independent GTRGAMMA model for each of the 887 SCNs. The gene trees were then used to infer a coalescent-based species tree with ASTRAL-III (Zhang et al., 2018), which infers a species tree from gene trees accounting for the incongruence produced by incomplete lineage sorting (ILS). Each gene tree was rooted and low support branches (Bootstrap support (BS) ≤ 10%) were contracted by Newick Utilities (Junier & Zdobnov, 2010), since collapsing gene tree nodes with low BS support will help to improve accuracy (Zhang et al., 2018).

Because the phylogeny inferred from 887 SCNs (see the results below) showed some incongruence with that in Ma et al. (2018) based on single nucleotide polymorphisms (SNPs), we employed phyparts v. 0.0.1 (Smith et al., 2015) to calculate the amount of conflict among the 887 SCN gene trees by comparing the nuclear gene trees against the ASTRAL species tree. We performed phylogenetic conflict analysis using phyparts with a BS threshold of 30% (i.e., gene-tree branches/nodes with less than 30% BS were considered uninformative), although BS with 70% in some studies has been used as the cutoff (Stull et al., 2020). The baseline for strong support has been rightfully challenged (Soltis & Soltis, 2003). Nevertheless, it is useful, albeit somewhat arbitrary, for filtering out poorly supported branches, thus alleviating noise in the results of the conflict analysis (Smith et al., 2015). Phyparts results were visualized with (by Matt Johnson, available from


National Natural Science Foundation of China, Award: 32000163

National Natural Science Foundation of China, Award: 31620103902

Smithsonian Institution, Award: DNA Barcode Network