Assembly and annotation of eleven Salix (shrub willow) genomes
Data files
Dec 06, 2022 version files 1.31 GB
-
README.md
2.08 KB
-
Salix_Genomes.tar.gz
1.31 GB
Feb 02, 2023 version files 1.31 GB
Abstract
The shrub willows (Salix section Vetrix) are an emerging bioenergy crop in North America and Eurasia. However, genomics resources in this section are still quite limited, with only a few reference genomes available, despite many species in use in breeding programs. Here we present de novo assemblies and annotations of eleven shrub willow genomes from six species. Copy number variation of candidate sex determination genes within each genome was characterized and revealed remarkable differences in putative master regulator gene duplication and deletion. We also analyzed copy number and expression of candidate genes involved in floral secondary metabolism and identified substantial variation across genotypes, which can be used for parental selection in breeding programs. Lastly, we report on a genotype that produces only female descendants and identified gene presence/absence variation in the mitochondrial genome that may be responsible for this unusual inheritance.
Methods
DNA Sequencing
Fresh young leaf tissue for all 11 Salix genotypes was collected and ground in liquid nitrogen. DNA extraction was performed using a modified CTAB based protocol. For long read sequencing, 1 μg of DNA was used as input to Oxford Nanopore’s genomic DNA by ligation sequencing kit (SQK-LSK109) and the subsequent library was sequenced on a R.9.4.1 flow cell. Short read sequencing of the same samples was performed on the Illumina HiSeq X Ten platform.
RNA Sequencing
RNA was extracted from eight tissues (root, xylem, internode, node, young leaf, mature leaf, petiole, and young stem) for all 11 genotypes, as well as fasciated shoot tissue from 04-BN-051. Strand-specific RNA-Seq libraries were prepared by BGI and sequenced on the DNB-Seq platform, which generated paired-end 150 bp reads. The same RNA preps from mature leaves and roots were also sequenced on the Oxford Nanopore MinION platform, with the exception of ‘Jorr’, which failed quality control. The SQK-PCB109 PCR-based cDNA library kit was used to generate sequencing libraries for leaf and root tissue for all 11 genotypes and were sequenced on R.9.4.1 flow cells.
Hi-C library preparation
Hi-C libraries were prepared with the Phase Genomics Proximo Plant Hi-C kit. Hi-C libraries were sequenced on the Illumina NovaSeq 6000 instrument which generated paired-end 150 bp reads. The sequencing data of each Hi-C library underwent quality control with the phase genomics hic_qc.py script (https://github.com/phasegenomics/hic_qc, accessed Nov. 15, 2021) to ensure a sufficient number of informative Hi-C reads were present in each library.
Genome Assembly
Assembly was performed with Oxford Nanopore reads using Flye 2.8.3. Illumina short reads were mapped to the assembled contigs with BWA-MEM. Pilon and a custom python script were used to generate the corrected draft assembly with the Illumina data. Assembled contigs were scaffolded using Hi-C reads with Falcon and Juicer Hi-C to generate phased genome assemblies. A BUSCO search of the Eudicot core genes was performed against each assembly to assess the quality and completeness of each genome. One assembly, 04-FF-016, produced two chimeric contigs, HiC_scaffold_5 and HiC_scaffold_6, each spanning the entire length of several chromosomes. BLASTN analysis was used to determine alignment to specific chromosomes and each chimeric contig was manually cut at the approximate site where mapping behavior became abnormal. Resulting scaffolds were appended with a letter (e.g. a, b, c, etc.) to denote their origin from the original chimeric scaffold.
Annotation
Genome annotation was performed with the LoReAn v2.5 pipeline, which utilized both Oxford Nanopore and Illumina RNA-Seq, along with protein models from the JGI Populus trichocarpa v4.1, Populus deltoides v2.1, and Populus nigra x maximowiczii v1.1 reference genome annotations obtained from Phytozome (https://phytozome-next.jgi.doe.gov accessed March 21, 2022), followed by Augustus ab initio gene prediction. BLASTN analysis was performed for each annotated transcript for every genome against the S. purpurea v5.1 annotation on Phytozome (https://phytozome-next.jgi.doe.gov accessed July 6, 2022) to identify homologous gene models. Functional prediction of mRNAs in each annotation was performed using interproscan 5.52-86.0. The estimated number of missing genes from each annotation was determined by performing a BLAST analysis of all S. purpurea v5.1 CDS sequences against all annotated genes for each genome and identifying those S. purpurea v5.1 genes without a match in each genome.
Data Availability
All raw sequencing data have been deposited at the NCBI SRA (https://www.ncbi.nlm.nih.gov/sra). Raw Illumina and nanopore DNA sequencing data can be accessed with the BioProject ID PRJNA827350. The raw Illumina RNA-Seq data can be accessed with the BioProject ID PRJNA827350. Nanopore RNA-Seq data can be accessed with the BioProject ID PRJNA888070.
Usage notes
Data include fasta files of genome assemblies, gff3 annotation files, and text files with genome annotation info. All files can be opened in any text editor in principle, but fasta files are large may be too large to open on smaller memory machines or may take considerable time to load and should be viewed in a command line. Assemblies and annotations can be loaded into IGV to view the assembly and annotation in browser form.