A high-quality genome assembly and annotation of the gray mangrove, Avicennia marina

Friis, Guillermo 1 ; Vizueta, Joel2 ; Nelson, David R.1 ; Khraiwesh, Basel1; Qudeimat, Enas1 ; Salehi-Ashtiani, Kourosh1 ; Ortega, Alejandra 3 ; Marshell, Alyssa4 ; Duarte, Carlos M.3 ; Smith, Edward1

Published May 27, 2020; Updated Feb 17, 2022 on Dryad. https://doi.org/10.5061/dryad.3j9kd51f5

Data files

May 27, 2020 version files 524.61 MB

Amar_GenomeAnnotation.tar.gz

524.61 MB

Jul 09, 2020 version files 514.16 MB

Amarina_GenomeDatasets.tar.gz

514.16 MB

Oct 19, 2020 version files 442.19 MB

Amarina_GenomeDatasets.rar

442.19 MB

Feb 17, 2022 version files 492.15 MB

Amar_GenomeAnnotation.tar.gz

492.15 MB

Abstract

The gray mangrove [Avicennia marina (Forsk.) Vierh.] is the most widely distributed mangrove species, ranging throughout the Indo-West Pacific. It presents remarkable levels of geographic variation both in phenotypic traits and habitat, often occupying extreme environments at the edges of its distribution. However, subspecific evolutionary relationships and adaptive mechanisms remain understudied, especially across populations of the West Indian Ocean. High-quality genomic resources accounting for such variability are also sparse. Here we report the first chromosome-level assembly of the genome of A. marina. We used a previously release draft assembly and proximity ligation libraries Chicago and Dovetail HiC for scaffolding, producing a 456,526,188 bp long genome. The largest 32 scaffolds (22.4 Mb to 10.5 Mb) accounted for 98 % of the genome assembly, with the remaining 2% distributed among much shorter 3,759 scaffolds (62.4 Kb to 1 Kb). We annotated 45,032 protein-coding genes using tissue-specific RNA-seq data in combination with de novo gene prediction, from which 34,442 were associated to GO terms. Genome assembly and annotated set of genes yield a 96.7% and 95.1% completeness score, respectively, when compared with the eudicots BUSCO dataset. Furthermore, an F_ST survey based on resequencing data successfully identified a set of candidate genes potentially involved in local adaptation, and revealed patterns of adaptive variability correlating with a temperature gradient in Arabian mangrove populations. Our A. marina genomic assembly provides a highly valuable resource for genome evolution analysis, as well as for identifying functional genes involved in adaptive processes and speciation.

Genome sequencing and assembly
The sequenced sample was leaf tissue obtained from an individual located at Ras Ghurab Island in the Arabian Gulf (Abu Dhabi, United Arab Emirates; 24.601°N, 4.566 °E), corresponding to the A. m. marina variety. A high-quality genome was produced using proximity ligation libraries and the software pipeline HiRiseat Dovetail Genomics, LLC. Briefly, for Chicagoand the Dovetail HiClibrary preparation, chromatin was fixed with formaldehyde. Fixed chromatin was then digested with DpnII and free blunt ends were ligated. Crosslinks were reversed, and the DNA purified from protein, which was then sheared to ~350 bp mean fragment size. Libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters, and sequencing was carried out on an Illumina HiSeq X platform. Chicago and Dovetail HiClibrary reads were then used as input data for genome assembly for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies. A previously reported draft genome of Avicennia marina (GenBank accesion: GCA_900003535.1) was used in the assembly pipeline, excluding scaffolds shorter than 1Kb since HiRise does not assemble them.

The mitochondrial genome was assembled using NOVOplasty2.7.2 and resequencing data based on Illumina paired-end 150 bp libraries from a conspecific individual (See below; Supplementary Information). The maturase (matR) mitochondrial gene available in NCBI (GenBank accession no. AY289666.1) was used for the input seed sequence.

Genome annotation
We performed the annotation of the A. marina genome using mRNA data from a set of tissues of conspecific individuals in combination with de novo gene prediction using BRAKER2 v2.1.5 (Hoff et al. 2016). Samples were collected on the coast of the Eastern Central Red Sea north of Jeddah in the Kingdom of Saudi Arabia (22.324 °N, 39.100 °E; Figure 1A). Total RNA was isolated from root, stem, leaf, flower, and seed using TRIzol reagent (Invitrogen, USA). RNA-seq libraries were prepared using TruSeq RNA sample prep kit (Illumina, Inc.), with inserts that range in size from approximately 100-400 bp. Library quality control and quantification were performed with a Bioanalyzer Chip DNA 1000 series II (Agilent), and sequenced in a HiSeq2000 platform (Illumina, Inc.). First, repetitive regions were modelled ab initio using RepeatModeler v2.0.1 (Flynn et al. 2019) in all scaffolds longer than 100 Kb with default options. The resulting repeat library was used to annotate and soft-mask repeats in the genome assembly with RepeatMasker 4.0.9 (Smit et al. 2015). Next, messenger RNA reads were mapped against the soft-masked genome assembly with HISAT2 (Kim et al. 2015). Gene prediction was conducted with BRAKER2 using both the RNA-seq data and the conserved orthologous genes from BUSCO Eudicots_odb10 as proteins from short evolutionary distance to provide hints and train GeneMark-ETP and Augustus (--etpmode; Hoff et al. 2019; Bruna et al. 2020; Lomsadze et al. 2005; Buchfink et al. 2015; Gotoh 2008; Iwata and Gotoh 2012; Li et al. 2009; Barnett et al. 2011; Lomsadze et al. 2014; Stanke et al. 2008; Stanke et al. 2006). The obtained gene annotation gff3 file was validated and used to generate the reported gene annotation statistics with GenomeTools (Gremme et al. 2013) and in-house Perl scripts. Finally, we conducted a similarity-based approach to assist the functional annotation of the predicted proteins. We integrated InterProScan v5.31 (Jones et al. 2014) and BLAST (Tatusova and Madden 1999) searches using the UniProt Swiss-Prot database and the annotated proteins from the Arabidopsis thaliana genome (UniProt Consortium 2019) to generate a final set of annotated functional genes.

Variant calling from resequencing data
Whole genome resequencing was carried out for the 60 individuals from 6 populations around the Arabian Peninsula at Novogene facilities. Illumina paired-end 150 bp libraries with insert size equal to 350 pb were prepared and sequenced in a Novaseq platform. A total of 2.4G reads were produced resulting in a mean coverage per site and sample of 85X before filtering. Read quality was evaluated using FASTQC after sorting reads by individual with AXE. Trimming and quality filtering treatment was conducted using Trim Galore, resulting in a set of reads ranging between 90 and 138 bp long. Reads were then mapped against the A. marina reference genome using the mem algorithm in the Burrows-Wheeler Aligner. Read groups were assigned and BAM files generated with Picard Tools version 1.126. We used the HaplotypeCaller + GenotypeGVCFs tools from the Genome Analysis Toolkit version 3.6-0 to produce a set of single nucleotide polymorphisms (SNPs) in the variant call format (vcf). Genotype quality and missing data filters for downstream analyses were implemented with vcftools. Samples with less than 25% of the sites genotyped were discarded. Then, a SNP matrix was constructed excluding those out of a range of coverage between 4 and 50 or with a genotyping phred quality score below 40. Positions for which one or more samples were not genotyped were removed, along with those presenting a minor allele count (MAC) below 3. Only the SNPs from the 32 major scaffolds were retained. A threshold for SNPs showing highly significant deviations from Hardy-Weinberg equilibrium (HWE) with a p-value of 10^-4 was also implemented to filter out false variants arisen by the alignment of paralogous loci. Final dataset consisted on 56 samples and 538,185 SNPs.

A high-quality genome assembly and annotation of the gray mangrove, Avicennia marina

Data files

Abstract

Methods

Usage notes