Assembly and annotation of eleven Salix (shrub willow) genomes

Hyden, Brennan 1 ; Feng, Kai2 ; Yates, Timothy2 ; Jawdy, Sara2 ; Cereghino, Chelsea2 ; Smart, Lawrence1 ; Muchero, Wellington2

Research facility: Oak Ridge National Laboratory

Published Dec 06, 2022; Updated Apr 07, 2025 on Dryad. https://doi.org/10.5061/dryad.5hqbzkh9f

Data files

Dec 06, 2022 version files 1.31 GB

README.md

2.08 KB
Salix_Genomes.tar.gz

1.31 GB

Feb 02, 2023 version files 1.31 GB

README.md

2.08 KB
Salix_Genomes.tar.gz

1.31 GB

Apr 07, 2025 version files 1.31 GB

Abstract

The shrub willows (Salix section Vetrix) are an emerging bioenergy crop in North America and Eurasia. However, genomics resources in this section are still quite limited, with only a few reference genomes available, despite many species in use in breeding programs. Here we present de novo assemblies and annotations of eleven shrub willow genomes from six species. Copy number variation of candidate sex determination genes within each genome was characterized and revealed remarkable differences in putative master regulator gene duplication and deletion. We also analyzed copy number and expression of candidate genes involved in floral secondary metabolism and identified substantial variation across genotypes, which can be used for parental selection in breeding programs. Lastly, we report on a genotype that produces only female descendants and identified gene presence/absence variation in the mitochondrial genome that may be responsible for this unusual inheritance.

Full assemblies and annotations of eleven shrub willow (Salix) genomes. Assemblies were performed using long read Oxford nanopore and short read illumina data and scaffolded with HiC. Annotation was performed using LoReAn software and illumina short read and Oxford Nanopore long read RNA-Seq from eight tissues from each genotype. Raw sequencing reads are available on NCBI SRA.

Description of the Data and file structure

The prefixes for each of the eleven genomes are as follows:
94006, a Salix purpurea female
94001, a Salix purpurea male
P63, a Salix suchowensis male
P294, a Salix suchowensis female
P295, a Salix suchowensis female
P336, a Salix integra female
SH3, a Salix koriyanagi female
04-FF-016, a Salix koriyanagi male
07-MBG-5027, a Salix viminalis female
Jorr, a Salix viminalis male
04-BN-051, a Salix udensis male

Each genome contains five files, as described below:
*_hap.FINAL.fasta ; This is the assembled genome in fasta format
*.gff3 ; This is the annotation file describing the location and relationship of each feature
*_mRNA_info.txt ; This file contains additional information about each transcript. Column one is the mRNA as described in the gff file, Column two is the PANTHER ID for the transcript, column three is the PANTHER ID description, column four is the closest BLAST hit in the JGI v5.1 S. purpurea genome, and column five is the description of the closest BLAST hit.
*_transcripts.fa ; This file is the nucleotide fasta sequence of each transcript, as described in the gff file
*_protein.fa ; This file is the protein fasta sequence of each peptide, as described in the gff file.

All files can be viewed on any standard text editor (e.g. Atom, notepad, textedit, sublime, etc.) IGV can be used the view the genome and annotation in browser form with the assembly and gff3 files.

Sharing/access Information

Raw DNA and RNA sequence data used to generate these assemblies and annotations are available on the NCBI sequence read archive.

April Version changes: "Salix_genomes_contigs2chrs.csv"was added, which describes which chromosome number each scaffold maps to (based on the standard Populus and Salix chromosome numbering system).

DNA Sequencing

Fresh young leaf tissue for all 11 Salix genotypes was collected and ground in liquid nitrogen. DNA extraction was performed using a modified CTAB based protocol. For long read sequencing, 1 μg of DNA was used as input to Oxford Nanopore’s genomic DNA by ligation sequencing kit (SQK-LSK109) and the subsequent library was sequenced on a R.9.4.1 flow cell. Short read sequencing of the same samples was performed on the Illumina HiSeq X Ten platform.

RNA Sequencing

RNA was extracted from eight tissues (root, xylem, internode, node, young leaf, mature leaf, petiole, and young stem) for all 11 genotypes, as well as fasciated shoot tissue from 04-BN-051. Strand-specific RNA-Seq libraries were prepared by BGI and sequenced on the DNB-Seq platform, which generated paired-end 150 bp reads. The same RNA preps from mature leaves and roots were also sequenced on the Oxford Nanopore MinION platform, with the exception of ‘Jorr’, which failed quality control. The SQK-PCB109 PCR-based cDNA library kit was used to generate sequencing libraries for leaf and root tissue for all 11 genotypes and were sequenced on R.9.4.1 flow cells.

Hi-C library preparation

Hi-C libraries were prepared with the Phase Genomics Proximo Plant Hi-C kit. Hi-C libraries were sequenced on the Illumina NovaSeq 6000 instrument which generated paired-end 150 bp reads. The sequencing data of each Hi-C library underwent quality control with the phase genomics hic_qc.py script (https://github.com/phasegenomics/hic_qc, accessed Nov. 15, 2021) to ensure a sufficient number of informative Hi-C reads were present in each library.

Genome Assembly

Assembly was performed with Oxford Nanopore reads using Flye 2.8.3. Illumina short reads were mapped to the assembled contigs with BWA-MEM. Pilon and a custom python script were used to generate the corrected draft assembly with the Illumina data. Assembled contigs were scaffolded using Hi-C reads with Falcon and Juicer Hi-C to generate phased genome assemblies. A BUSCO search of the Eudicot core genes was performed against each assembly to assess the quality and completeness of each genome. One assembly, 04-FF-016, produced two chimeric contigs, HiC_scaffold_5 and HiC_scaffold_6, each spanning the entire length of several chromosomes. BLASTN analysis was used to determine alignment to specific chromosomes and each chimeric contig was manually cut at the approximate site where mapping behavior became abnormal. Resulting scaffolds were appended with a letter (e.g. a, b, c, etc.) to denote their origin from the original chimeric scaffold.

Annotation

Genome annotation was performed with the LoReAn v2.5 pipeline, which utilized both Oxford Nanopore and Illumina RNA-Seq, along with protein models from the JGI Populus trichocarpa v4.1, Populus deltoides v2.1, and Populus nigra x maximowiczii v1.1 reference genome annotations obtained from Phytozome (https://phytozome-next.jgi.doe.gov accessed March 21, 2022), followed by Augustus ab initio gene prediction. BLASTN analysis was performed for each annotated transcript for every genome against the S. purpurea v5.1 annotation on Phytozome (https://phytozome-next.jgi.doe.gov accessed July 6, 2022) to identify homologous gene models. Functional prediction of mRNAs in each annotation was performed using interproscan 5.52-86.0. The estimated number of missing genes from each annotation was determined by performing a BLAST analysis of all S. purpurea v5.1 CDS sequences against all annotated genes for each genome and identifying those S. purpurea v5.1 genes without a match in each genome.

Data Availability

All raw sequencing data have been deposited at the NCBI SRA (https://www.ncbi.nlm.nih.gov/sra). Raw Illumina and nanopore DNA sequencing data can be accessed with the BioProject ID PRJNA827350. The raw Illumina RNA-Seq data can be accessed with the BioProject ID PRJNA827350. Nanopore RNA-Seq data can be accessed with the BioProject ID PRJNA888070.

Assembly and annotation of eleven Salix (shrub willow) genomes

Data files

Abstract

README: Assembly and Annotation of Eleven Salix (shrub willow) genomes

Description of the Data and file structure

Sharing/access Information

Methods

Usage notes

Works referencing this dataset