Kinetochore and ionomic adaptation to whole genome duplication
Data files
Oct 17, 2023 version files 291.28 MB
-
C_excelsa_V5_braker2_wRseq.aa.LTPG.fasta.gz
-
C_excelsa_V5_braker2_wRseq.gff3
-
C_excelsa_V5.fa.gz
-
C_excelsa_V5.fa.masked.fa.gz
-
Cochlearia-4dg.list
-
README.md
Abstract
Whole genome duplication (WGD) brings challenges to key processes like meiosis but nevertheless is associated with diversification in all kingdoms. How is WGD tolerated, and what processes commonly evolve to stabilize the new polyploid lineage? Here we study this in Cochlearia spp., which have experienced multiple rounds of WGD in the last 300,000 years. We first generate a chromosome-scale genome and sequence 113 individuals from 33 diploid, tetraploid, hexaploid, and outgroup populations. We detect the clearest post-WGD selection signatures in functionally interacting kinetochore components and ion transporters. We structurally model these derived selected alleles, associating them with known WGD-relevant functional variation, and compare these results to independent recent post-WGD selection in Arabidopsis arenosa and Cardamine amara. Some of the same biological processes evolve in all three WGDs, but specific genes recruited are flexible. This points to a polygenic basis for modifying systems that control the kinetochore, meiotic crossover number, DNA repair, ion homeostasis, and cell cycle. Given that DNA management (especially repair) is the most salient category with the strongest selection signal, we speculate that the generation rate of structural genomic variants may be altered by WGD in young polyploids, contributing to their occasionally spectacular adaptability observed across kingdoms.
README: Kinetochore and ionomic adaptation to whole genome duplication
https://doi.org/10.5061/dryad.ncjsxkt1s
Biological context and methodology for the below are described in https://doi.org/10.1101/2023.09.27.559727
Description of the data and file structure
C_excelsa_V5.fa.gz
Chromosome-level genome assembly of Cochlearia excelsa used as reference to which reads were aligned in Bray et al 2023 (https://doi.org/10.1101/2023.09.27.559727)
C_excelsa_V5.fa.masked.fa.gz
Repeat masked version of above reference genome
C_excelsa_V5_braker2_wRseq.gff3
Cochlearia excelsa reference genome annotation produced with MAKER2 as here described.
Cochlearia-4dg.list
List of fourfold degenerate (putatively neutral) sites used in the Bray et al 2023 and Hämälä et al 2023 publications produced with https://github.com/harvardinformatics/degenotate
C_excelsa_V5_braker2_wRseq.aa.LTPG.fasta.gz
Protein fasta from annotation above
Sharing/Access information
All sequence data for this study have been deposited in the European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena/browser/view/PRJEB66308), accession number PRJEB66308.
Methods
Reference Genome Assembly and Alignment
We generated a long read-based de novo genome assembly using Oxford Nanopore and Hi-C approaches, below.
High Molecular Weight DNA isolation and Oxford Nanopore sequencing. A total of 0.4 g Cochlearia excelsa leaf material from one individual plant was ground using liquid nitrogen before the addition of 10 ml of CTAB DNA extraction buffer (100 mM Tris-HCl, 2% CTAB, 1.4 M NaCl, 20 mM EDTA, and 0.004 mg/ml Proteinase K). The mixture was incubated at 55°C for 1 hour then cooled on ice before the addition of 5 ml Chloroform. This was then centrifuged at 3000 rpm for 30 minutes and the upper phase taken, this was added to 1X volume of phenol:chloroform:isoamyl- alcohol and spun for 30 minutes at 3000 rpm. Again, the upper phase was taken and mixed with a 10% volume of 3M NaOAc and 2.5X volume of 100% ethanol at 4 °C. This was incubated on ice for 30 minutes before being centrifuged for 30 minutes at 3000 rpm and 4 °C. Three times the pellet was washed in 4ml 70% ethanol at 4 °C before being centrifuged again for 10 minutes at 3000 rpm and 4°C. The pellet was then air-dried and resuspended in 300 μl nuclease-free water containing 0.0036 mg/ml RNase A. The quantity and quality of high molecular weight DNA was checked on a Qubit Fluorometer 2.0 (Invitrogen) using the Qubit dsDNA HS Assay kit. Fragment sizes were assessed using a Q-card (OpGen Argus) and the Genomic DNA Tapestation assay (Agilent). Removal of short DNA fragments and final purification to HMW DNA was performed with the Circulomics Short Read Eliminator XS kit.
Long-read libraries were prepared using the Genomic DNA by Ligation kit (SQK-LSK109; Oxford Nanopore Technologies) following the manufacturer’s procedure. Libraries were then loaded onto a R9.4.1 PromethION Flow Cell (Oxford Nanopore Technologies) and run on a PromethION Beta sequencer. Due to the rapid accumulation of blocked flow cell pores or due to apparent read length anomalies on some Cochlearia runs, flow cells used in runs were treated with a nuclease flush to digest blocking DNA fragments before loading with fresh library according to the Oxford Nanopore Technologies Nuclease Flush protocol, version NFL_9076_v109_revD_08Oct2018.
Genome size estimation and computational ploidy inference. We used KMC to create a k- mer frequency spectrum (Kmer length=21) of trimmed Illumina reads. We then used GenomeScope 2.0 (parameters: -k 21 -m 61) and Smudgeplot to estimate genome size and heterozygosity from k-mer spectra.
Data processing and assembly. Fast5 sequences produced by PromethION sequencing were base called using the Guppy 6 high accuracy base calling model (dna_r9.4.1_450bps_hac.cfg) and the resulting fastq files were quality filtered by the base caller. A total of 17.2 GB base called data were generated for the primary assembly, resulting in 60x expected coverage. Primary assembly was performed in Flye and Necat. The contigs were polished to improve the single-base accuracy in a single round of polishing in Medaka and Pilon.
Pseudomolecule construction by Hi-C, assembly cleanup, and polishing. To scaffold the assembled contigs into pseudomolecules, we performed chromosome conformation capture using HiC. Leaves from a single plant were snap-frozen in liquid N and ground to a fine powder using mortar and pestle. The sample was then homogenised, cross-linked and shipped to Phase Genomics (Seattle, USA), who prepared and sequenced an in vivo Hi-C library. After filtering low-quality reads with Trimmomatic, we aligned the Hi-C reads against the contig-level assembly using bwa-mem (settings -5 -S -P) and removed PCR duplicated using Picard Tools (https://broadinstitute.github.io/picard/). We used 3D-DNA to conduct the initial scaffolding, followed by a manual curation in Juicebox. After manually assigning chromosome boundaries, we searched for centromeric and telomeric repeats to orient the chromosome arms and to assess the completeness of the assembled pseudomolecules. To identify the centromeric repeat motif in C. excelsa, we used the RepeatExplorer pipeline to search for repetitive elements from short-read sequence data originating from the reference individual. RepeatExplorer discovered a highly abundant 102 nucleotide repeat element (comprising 21% of the short-read sequence), which we confirmed as the centromeric repeat motif by fluorescence in situ hybridisation. Using BLAST, we localised the centromeric and telomeric (TTTAGGG) repeats and used them to orient the chromosome arms. We performed a final assembly cleanup in Blobtools (Fig.S1). Gene space completeness was assessed using BUSCO version 3.0.2).
Assembly annotation and RNA-seq
Prior to gene annotation, we identified and masked transposable element (TE) sequences from the genome assembly. To do so, we used the EDTA pipeline, which combines multiple methods to comprehensively identify both retrotransposons and DNA transposons. After running EDTA on our chromosome-level genome assembly, we performed BLAST queries against a curated protein database from Swiss-Prot to remove putative gene sequences from the TE library and masked the remaining sequences from the assembly using RepeatMasker (https://www.repeatmasker.org).
We then used the BRAKER pipeline to conduct gene annotation on the TE-masked genome assembly. Evidence types included RNAseq data from the identical C. excelsa line and protein data from related species. RNA-seq was generated from bud, stem and leaf tissue. Total RNA was extracted from each tissue using the Qiagen RNeasy Extraction Kit. Stranded RNA libraries with polyA were constructed Using NEB Next Ultra II Directional RNA Library Prep Kit for Illumina and then evaluated by qPCR, TapeStation and Qubit at the DeepSeq facility (Nottingham, UK) before being sequenced at PE 150 at Novogene, Inc. (Cambridge, UK). We mapped the RNA-seq reads of each tissue to our reference genome using STAR with default parameters (-twopassMode Basic) before running BRAKER2. Running BRAKER2 without UTR prediction generated more gene models and much better BUSCO metrics than with UTR prediction (97.8% [raw, pre-Blobtools trimmed] complete BUSCOs without UTR prediction vs 91.7% with UTR prediction), so for the final annotation, we used the more complete set and ran BRAKER2 without UTR prediction.