A high-quality genome assembly and annotation of the gray mangrove, Avicennia marina
Data files
May 27, 2020 version files 524.61 MB
-
Amar_GenomeAnnotation.tar.gz
Jul 09, 2020 version files 514.16 MB
-
Amarina_GenomeDatasets.tar.gz
Oct 19, 2020 version files 442.19 MB
-
Amarina_GenomeDatasets.rar
Feb 17, 2022 version files 492.15 MB
-
Amar_GenomeAnnotation.tar.gz
Abstract
The gray mangrove [Avicennia marina (Forsk.) Vierh.] is the most widely distributed mangrove species, ranging throughout the Indo-West Pacific. It presents remarkable levels of geographic variation both in phenotypic traits and habitat, often occupying extreme environments at the edges of its distribution. However, subspecific evolutionary relationships and adaptive mechanisms remain understudied, especially across populations of the West Indian Ocean. High-quality genomic resources accounting for such variability are also sparse. Here we report the first chromosome-level assembly of the genome of A. marina. We used a previously release draft assembly and proximity ligation libraries Chicago and Dovetail HiC for scaffolding, producing a 456,526,188 bp long genome. The largest 32 scaffolds (22.4 Mb to 10.5 Mb) accounted for 98 % of the genome assembly, with the remaining 2% distributed among much shorter 3,759 scaffolds (62.4 Kb to 1 Kb). We annotated 45,032 protein-coding genes using tissue-specific RNA-seq data in combination with de novo gene prediction, from which 34,442 were associated to GO terms. Genome assembly and annotated set of genes yield a 96.7% and 95.1% completeness score, respectively, when compared with the eudicots BUSCO dataset. Furthermore, an FST survey based on resequencing data successfully identified a set of candidate genes potentially involved in local adaptation, and revealed patterns of adaptive variability correlating with a temperature gradient in Arabian mangrove populations. Our A. marina genomic assembly provides a highly valuable resource for genome evolution analysis, as well as for identifying functional genes involved in adaptive processes and speciation.
Methods
Genome sequencing and assembly
The sequenced sample was leaf tissue obtained from an individual located at Ras Ghurab Island in the Arabian Gulf (Abu Dhabi, United Arab Emirates; 24.601°N, 4.566 °E), corresponding to the A. m. marina variety. A high-quality genome was produced using proximity ligation libraries and the software pipeline HiRise at Dovetail Genomics, LLC. Briefly, for Chicago and the Dovetail HiC library preparation, chromatin was fixed with formaldehyde. Fixed chromatin was then digested with DpnII and free blunt ends were ligated. Crosslinks were reversed, and the DNA purified from protein, which was then sheared to ~350 bp mean fragment size. Libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters, and sequencing was carried out on an Illumina HiSeq X platform. Chicago and Dovetail HiC library reads were then used as input data for genome assembly for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies. A previously reported draft genome of Avicennia marina (GenBank accesion: GCA_900003535.1) was used in the assembly pipeline, excluding scaffolds shorter than 1Kb since HiRise does not assemble them.
The mitochondrial genome was assembled using NOVOplasty2.7.2 and resequencing data based on Illumina paired-end 150 bp libraries from a conspecific individual (See below; Supplementary Information). The maturase (matR) mitochondrial gene available in NCBI (GenBank accession no. AY289666.1) was used for the input seed sequence.
Genome annotation
We performed the annotation of the A. marina genome using mRNA data from a set of tissues of conspecific individuals in combination with de novo gene prediction using BRAKER2 v2.1.5 (Hoff et al. 2016). Samples were collected on the coast of the Eastern Central Red Sea north of Jeddah in the Kingdom of Saudi Arabia (22.324 °N, 39.100 °E; Figure 1A). Total RNA was isolated from root, stem, leaf, flower, and seed using TRIzol reagent (Invitrogen, USA). RNA-seq libraries were prepared using TruSeq RNA sample prep kit (Illumina, Inc.), with inserts that range in size from approximately 100-400 bp. Library quality control and quantification were performed with a Bioanalyzer Chip DNA 1000 series II (Agilent), and sequenced in a HiSeq2000 platform (Illumina, Inc.). First, repetitive regions were modelled ab initio using RepeatModeler v2.0.1 (Flynn et al. 2019) in all scaffolds longer than 100 Kb with default options. The resulting repeat library was used to annotate and soft-mask repeats in the genome assembly with RepeatMasker 4.0.9 (Smit et al. 2015). Next, messenger RNA reads were mapped against the soft-masked genome assembly with HISAT2 (Kim et al. 2015). Gene prediction was conducted with BRAKER2 using both the RNA-seq data and the conserved orthologous genes from BUSCO Eudicots_odb10 as proteins from short evolutionary distance to provide hints and train GeneMark-ETP and Augustus (--etpmode; Hoff et al. 2019; Bruna et al. 2020; Lomsadze et al. 2005; Buchfink et al. 2015; Gotoh 2008; Iwata and Gotoh 2012; Li et al. 2009; Barnett et al. 2011; Lomsadze et al. 2014; Stanke et al. 2008; Stanke et al. 2006). The obtained gene annotation gff3 file was validated and used to generate the reported gene annotation statistics with GenomeTools (Gremme et al. 2013) and in-house Perl scripts. Finally, we conducted a similarity-based approach to assist the functional annotation of the predicted proteins. We integrated InterProScan v5.31 (Jones et al. 2014) and BLAST (Tatusova and Madden 1999) searches using the UniProt Swiss-Prot database and the annotated proteins from the Arabidopsis thaliana genome (UniProt Consortium 2019) to generate a final set of annotated functional genes.
Variant calling from resequencing data
Whole genome resequencing was carried out for the 60 individuals from 6 populations around the Arabian Peninsula at Novogene facilities. Illumina paired-end 150 bp libraries with insert size equal to 350 pb were prepared and sequenced in a Novaseq platform. A total of 2.4G reads were produced resulting in a mean coverage per site and sample of 85X before filtering. Read quality was evaluated using FASTQC after sorting reads by individual with AXE. Trimming and quality filtering treatment was conducted using Trim Galore, resulting in a set of reads ranging between 90 and 138 bp long. Reads were then mapped against the A. marina reference genome using the mem algorithm in the Burrows-Wheeler Aligner. Read groups were assigned and BAM files generated with Picard Tools version 1.126. We used the HaplotypeCaller + GenotypeGVCFs tools from the Genome Analysis Toolkit version 3.6-0 to produce a set of single nucleotide polymorphisms (SNPs) in the variant call format (vcf). Genotype quality and missing data filters for downstream analyses were implemented with vcftools. Samples with less than 25% of the sites genotyped were discarded. Then, a SNP matrix was constructed excluding those out of a range of coverage between 4 and 50 or with a genotyping phred quality score below 40. Positions for which one or more samples were not genotyped were removed, along with those presenting a minor allele count (MAC) below 3. Only the SNPs from the 32 major scaffolds were retained. A threshold for SNPs showing highly significant deviations from Hardy-Weinberg equilibrium (HWE) with a p-value of 10-4 was also implemented to filter out false variants arisen by the alignment of paralogous loci. Final dataset consisted on 56 samples and 538,185 SNPs.
Usage notes
Warning:
The genome assembly here provided has also been deposited at DDBJ/ENA/GenBank along with raw sequence data from Chicago and HiC libraries under the accession JABGBM000000000, Bioproject (SRA) accession: PRJNA629068; Biosample accession: SAMN14766548. For the use of this genome as a reference, and especially for the identification of functional genes based on the annotation, we recomend the version here submitted, given that the version avalable in GenBank has gone through further trimming due to requirements of the submission process, which entails a disadjustment with respect the gene coordinates provided in the gff3 file and supplementary resources.
For detailed information about the datsets here provided, check the our publication https://academic.oup.com/g3journal/article/11/1/jkaa025/6026961