Assembly and annotation of the saguaro cactus genome – SGP5p v2
Data files
Apr 24, 2023 version files 1.42 GB
-
Cgig_SGP5p_v2_5942_220112.fa
-
Cgig_SGP5p_v2_MAKER_34211.gff
-
Cgig_SGP5p_v2_Rfam.gff
-
Cgig_SGP5p_v2_tRNAScan.gff
-
Cgig_SGP5p_v2.mod.EDTA.intactN.gff3
-
Cgig_SGP5p_v2.mod.EDTA.TEannoN.gff3
-
README.md
Abstract
We present an improved genome assembly of the saguaro cactus (Carnegiea gigantea (Engelm.) Britton & Rose), obtained by incorporating long-read PacBio data to the existing short reads. The assembly improves in terms of total size, contiguity, and accuracy, allowing us to extend the range of sequence analyses beyond the single-gene scale. Consequently, 20% more genes were identified, expanding the resources for a neglected yet very peculiar plant family such as Cactaceae.
Methods
Seeds from the SGP5 plant were germinated and grown in a greenhouse. A 5 cm seedling was selected for DNA extraction using a modified CTAB protocol. The specimen was named SGP5p. A PacBio WGS library was constructed with SGP5p genomic DNA and sequenced in three SMRT cells 1M on a PacBio Sequel instrument.
The genome was assembled with MaSuRCA (v.3.3.9) with NUM_THREADS = 40 JF_SIZE = 14000000000. Input data were the Illumina data from Copetti et al., 2017 (SRR5036292, SRR5036295, and SRR5036296) and the newly-generated PacBio data (SRR17493743, SRR17493742, and SRR17493741). For the coverage analysis, MaSuRCA error-corrected megareads were aligned to the assembly with minimap2 (v.2.17-r941), alignments were parsed with SAMtools (v.1.13, subcommands view, sort, index) and bedtools (v. 2.30.0, subcommand coverage). Mean coverage in 10 kb regions was plotted.
Chloroplast and mitochondria genomes were downloaded from and NCBI accession FP885845, respectively, and used as a database to identify organellar sequences in the SGP5p v2 assembly via BLASTN (v.2.11.0+). One scaffold contained the whole chloroplast genome, and the sequence was edited to reflect the same start position as in other chloroplast genomes. Assembly errors were corrected by four rounds of subread alignment (with minimap2) and consensus calling with Racon (v.1.4.20). The resulting sequence was then incorporated together with the nuclear genome assembly. BLASTN (-evalue 1e-5) alignment revealed a single scaffold with similarity to the mitochondria genome.
Properties of the assembly were assessed with BUSCO (v.5.2.2, --augustus_species tomato -l embryophyta_odb10 -m genome -e 1e-10) and KAT (v2.4.2, subcommand comp -H 10000000000 -h -m 21). Assembly comparison was performed with QUAST (v. 5.0.2 --large --conserved-genes-finding --no-snps --no-sv --no-read-stats).
RNA-Seq data from Copetti et al., 2017 (SRR5134694, SRR5134696, SRR5134695, and SRR5134693) were aligned to the assembly with Hisat2 (v2.2.1, -dta --max-intronlen 100000), then parsed with SAMtools (subcommands view, sort, index), Stringtie (v2.1.7, -m 150), and Gffread (v0.12.7) to obtain hints for gene prediction. Other sets of evidence were: gene models, Lophophora williamsii unigenes and a plant protein library, Augustus and SNAP ab initio predictions from Copetti et al., 2017. The assembly as masked for repeats with RepeatMasker (https://www.repeatmasker.org/, v. open-4.0.7, Search Engine: NCBI/RMBLAST [ 2.6.0+ ]) and the repeat library from Copetti et al., 2017. MAKER gene models that did not start with a methionine were removed, as well as models with similarity to TE genes for more than 40% BLASTP (-evalue 1e-5) identity, of at least 33 aa, and spanning at least 40% of the query length.
A global TE annotation was performed with EDTA (v.2.0.1, --cds [CDS from gene annotation] --anno 1 --evaluate 1 --sensitive 1), non-coding RNAs were predicted with Infernal (v.1.1.4, Rfam database v14.6, as cmscan -Z 10000 --cut_ga --rfam --nohmmonly --fmt 2 --oskip) and tRNA-Scan-SE (v.2.0.9).
Usage notes
Files are in standard formats for genomic data: fasta and gff.