Supplementary dataset for: Genome-wide signatures of adaptation to extreme environments in red algae

Cho, Chung Hyun 1 ; Yoon, Hwan Su 1

Published Oct 18, 2022; Updated Dec 28, 2022 on Dryad. https://doi.org/10.5061/dryad.cfxpnvx7b

Data files

Oct 18, 2022 version files 445.77 MB

Dryad_folder_221017.zip

445.76 MB
README.txt

10.18 KB

Dec 06, 2022 version files 447.51 MB

Dryad_folder_221206.zip

447.50 MB
README.txt

10.42 KB

Dec 28, 2022 version files 459.82 MB

Dryad_folder_221228.zip

459.81 MB
README.txt

11.60 KB

Abstract

The high temperature, acidity, and heavy-metal-rich environments associated with hot springs have a major impact on biological processes in resident cells. Understanding the origin and evolution of adaptations to these extreme environments has long intrigued scientists. One group of photosynthetic eukaryotes, the Cyanidiophyceae (Rhodophyta), has successfully thrived in hot springs and associated sites (e.g., endolithic) worldwide for >1 billion years. These poly-extremophilic red algae are models for studying a wide range of topics in cell evolution. Here, we analyze chromosome-level assemblies from three Cyanidiophyceae species, the Cyanidiales Cyanidium caldarium 063 E5 and Cyanidiococcus yangmingshanensis 8.1.23 F7 and the Galdieriales Galdieria sulphuraria 108.79 E11, to gain insights into environmental adaptation. We find that horizontal gene transfer (HGT) has played a major role in this process, as has other mechanisms such as subtelomeric gene duplication (STGD) of functional genes and the elimination of canonical eukaryotic traits, including microRNA processing. These extremophilic adaptation strategies are shared by the two major orders, Cyanidiales and Galdieriales, but most of the specialized genes evolved independently in each lineage. Our findings provide significant and novel insights into Cyanidiophyceae adaptation to hot-spring environments and demonstrate how the genomic consequence of extremophilic adaptation varies among the taxa in different microhabitats. The latter result underlines the power of local selection to shape eukaryotic genomes that face vastly different stresses, although the cells may live in close proximity.

Whole genome sequencing (WGS) and whole transcriptome sequencing (WTS). For genome and transcriptome sequencing, both short-read and long-read sequencing were conducted. For PacBio whole genome sequencing (WGS), we used SMRTbell® Express Template Prep Kit 2.0 (Pacific Biosciences) with a 15 kbp size selection to construct Sequel I sequencing libraries of Cyanidium and Cyanidiococcus. For Galdieria PacBio WGS, SMRTbell® Express Template Prep Kit 1.0 (Pacific Biosciences) with a 9 kbp size selection was used to prepare the RS II sequencing library and SMRTbell Express TPK 2.0 (Pacific Biosciences) was used for HiFi library preparation. All experiments followed the manufacturer's standard protocol, without shearing step in Cyanidium and Galdieria samples. SQK-LSK109 ligation kit (Oxford Nanopore Technologies) was used to construct a library of Galdieria PromethION sequencing without shearing step and a 20 kbp size selection. For Illumina HiSeq2500 WGS of Cyanidium and Galdieria species, TruSeq® Nano DNA Prep Kit (Illumina) with an insert size 550 bp was used to prepare gDNA sequencing libraries. The same kit and protocol were used for Cyanidiococcus WGS and ran in the Illumina NovaSeq6000 platform. SMARTer PCR cDNA Synthesis Kit (Clontech) and SMRTbell® Express Template Prep Kit 1.0 (Pacific Biosciences) were used to prepare PacBio WTS (Iso-Seq) libraries. Clustering and deduplication of Iso-Seq reads were done by IsoSeq v3 implemented in Sequel SMRT® Link v8.0 and high-quality reads (99% accuracy) from clustered results were only used for subsequent analysis. For Illumina WTS (RNA-Seq), TruSeq® Stranded mRNA Prep Kit (Illumina) was used for library construction for all species and those libraries were sequenced with Illumina NovaSeq6000 platform. Adapter and quality trimming for Illumina sequencing reads were conducted using Trimmotatic v0.36 with parameter settings of ‘ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:100’.

Genome size estimation and genome assembly. We chose different approaches for genome assembly of individual species due to the differences in sequencing methods and assembly performances. Although the basic outline of the assembly process is consistent across species, we have used multiple platforms and methods to improve the quality of each species' assembly. The basic outline of assembly is as follows: i) build a draft assembly using long-read sequencing platforms (e.g., PacBio, Nanopore) applying multiple assemblers (e.g., HGAP, CANU, FALCON, MaSuRCA), ii) sort out organelle genomes (e.g., mitochondria, chloroplast) to get nuclear genome assembly only, iii) use additional scaffolding method (e.g., RaGOO) based on reference assembly or manually complement non-covering regions from other assemblers, iv) use haplo-merging tools (e.g., Purge-Dups, Purge Haplotigs) to remove duplicated regions that are not considered necessary in a haploid genome, v) correct assembled genome using Illumina reads (e.g., Bowtie2, Pilon) and assess chimeric region based on mapping coverage of reads.

For Cyanidium caldarium 063 E5, we have only used two assemblers, MaSuRCA v3.4.2 was used for the main genome, and Miniasm v0.3-r179 was used for complementing regional difference between two assembled contigs. RaGOO v1.1 with long read validation was used for contig scaffolding to finalize scaffolds, and purge-dup v1.2.5 was used to remove haplotigs. Finally, the final 20 chromosomes were recovered from the scaffolding process, and chromosome sequences were processed for error correction with pre-processed short-read data using Bowtie2 v2.3.4.1 (‘very-sensitive’ option) and Pilon v1.23. We repeated this correction step until no conflict sequence was found between corrected and query genomes.

The draft genome of Cyanidiococcus yangmingshanensis 8.1.23 F7 was assembled using HGAP4 implied in PacBio SMRT portal and we compared the result with FALCON-Unzip v1.8.1 assembly result. Organelle genomes were sorted out from assembled genomes using previously established plastid genomes and mitogenomes. We were able to recover 20 chromosomes from HGAP4 result without using any scaffolding process, and FALCON contigs were used to refine uncovered subtelomere regions. Genome correction, as done in Cyanidium, was used to fine-tune the genome sequence after recovering Cyanidiococcus chromosomes.

Besides hybrid assembly using different platform sequencing couldn’t give us a clear assembly result of the Galdieria genome, we used different combinations of data for assembly; i) FALCON assembler v0.3.0 using PacBio HiFi reads, ii) Nanopore sequencing-based CANU v2.2 assembly, iii) PacBio RS II-based HGAP3, and iv) MaSuRCA v3.2.4 (PacBio and Illumina hybrid assembly). The basic structure of the Galdieria sulphuraria 108.79 E11 genome was built using HiFi result, and other assemblers were used for genome scaffolding and obtaining unique gene regions that HiFi assembly didn’t cover. Because the Galdieria genome has higher heterozygosity than other Cyanidiales lineages, we used different correction tools (e.g., Pilon v1.2.4, NextPolish v1.2.3, Hapo-G v1.0) using Illumina and PacBio HiFi reads with multiple replications for genome polishing. In addition, due to small chromosome size and duplicated regions across chromosomes, it was hard to discriminate or pair each chromosome to make a haploid genome. As a result, we decided to include a few overlapping chromosomal contigs (e.g., haplotigs) in the Galdieria genome as a pan-genome concept.

PacBio reads were mapped to assembled genomes by minimap v2.17-r941 after all genomes were constructed, and WGSCoveragePlotter was used to visualize mapping coverage of each species. We have used Jellyfish v2.2.8 and KMC v2.3.0 to counting k-mers and estimated genome size using GenomeScope 2.0. When compared to the estimated genome size using k-mers, the assembled genome covered at least 90% of the predicted size.

Gene modeling and annotation. After reconstruction of genomes, we mapped Illumina RNA-Seq and PacBio Iso-Seq data by STAR(long) v2.7.5a to identify transcribed regions from genome data. Transcriptome-mapped data (e.g., Illumina RNA-Seq, PacBio Iso-Seq) was used for the training set of ab initio gene modeling, and BRAKER v2.1.6 and GeMoMa v1.7.1 were performed for the gene annotation. However, unlike Galdieria species, BRAKER-based gene annotation did not work well with Cyanidiales genomes due to Cyanidiales' unique gene features (e.g., intron-poor gene, short intergenic region). Considering these features, we used Augustus v3.3.1 for ab initio modeling based on BUSCO training sets and Exonerate v2.4.0 for homology-based gene prediction using reference proteins of Cyanidioschyzon and Cyanidiococcus. Combining all gene modeling results with RNA-Seq and Iso-Seq mapping information, we finalized and corrected gene modeling by manual inspection of integrated information (e.g., ab initio gene modeling, reference proteome homology-based gene modeling, transcript-mapped regions) in all three species. Additionally, some of the putatively mispredicted genes in the Galdieria genome (approximately 70 genes) were manually removed based on two criteria: i) exclusive intron patterns without support from RNA-seq and Iso-Seq data, ii) no homology with other proteins and lack of a function domain inside the protein. The completeness of gene modeling was verified by BUSCO v3.0.2 using general eukaryote database (‘eukaryota_odb9’). Despite the availability of a more recent BUSCO database ('eukaryota odb10, n:303'; 21.1% missing BUSCOs in Cyanidioschyzon 10D), we chose to use the previous version database ('eukaryota odb9, n:255'; 3.6% missing BUSCOs in Cyanidioschyzon 10D) because recent version contains many missing genes that were lost in the cyanidiophycean lineage tested by reference genome (Cyanidioschyzon 10D). We used multiple methods for functional annotations of genes in each species: i) BLAST-based search (e.g., MMSeqs2, DIAMOND) against NCBI nr protein database, ii) HMMER-based search against a customized HMM database of KEGG Orthologs using KofamKOALA (ver. 2021-03-01), iii) BLAST-based search using eggNOG v5.0, which is a specialized database for functional annotation. For functional RNA annotation, we applied Infernal v1.1.2 using Rfam v12.5 (March 2021, 3940 families) database. Transcription start site predictions were identified by TSSPlant with the support of an in-house python script.

Repeat sequences in genomes were identified using the de novo method in RepeatModeler v2.0.2a (http://www.repeatmasker.org/RepeatModeler) following the analysis pipeline used in a previous study. We used 13 and 14 l-mers optimized from ‘log4[genome size] + 1’ for the repeat analysis and classified them into repeat subclasses using RepBase (updated October 26th, 2018) and Dfam v3.3 (November 09th, 2020) database. The genetic distance between repeat copies found was extracted from the output of RepeatMasker v4.1.2-p1 and used to calculate Kimura distance values.

Genome analysis. Nucleotide sequence alignment-based genome comparisons were performed by JupiterPlot v1.0 (https://github.com/JustinChu/JupiterPlot) to see structural variation. However, nucleotide alignment-based genome comparison between cyanidiophycean species had insufficient resolution, so we conducted gene synteny-based comparison for higher levels of taxonomy. Cyanidiophyceae genomes were compared using synteny blocks identified by MCScanX with a minimum syntenic block length of five genes and a maximum gap between genes in a syntenic block of 25 genes. Tree view mode of SynVisio was used to visualize the results of the synteny block comparison. To identify subtelomeric regions from genomes, LASTZ alignment v7.0.2 was used to see if there were any conserved regions between chromosomes.

Following the analysis pipeline used in a previous study, repeat sequences in genomes were identified using the de novo method using the RepeatModeler v2.0.2a (http://www.repeatmasker.org/RepeatModeler). We used 13 and 14 l-mers optimized by round from ‘log4[genome size] + 1’ for repeat analysis and classified into repeat subclasses using RepBase (updated October 26th, 2018) and Dfam v3.3 (November 09th, 2020) database. Genetic distance between repeat copies found in genomic sequences was parsed from the output of RepeatMasker v4.1.2-p1 and used to measure the Kimura’s distance.

The grouping of orthologous genes was performed by Orthofinder v2.5.2 with the default option, and protein datasets were collected from 35 representative taxa of Archaeplastida. Gene gain and loss events of cyanidiophycean algae were tested by the Dollo parsimony method (DolloP) using Archaeplastida-based orthogroups. We’ve used this orthogroup information for further analysis of gene families, however, two major issues have arisen: i) some misannotated genes found in individual strains combine two independent gene families into one orthogroup that has no functional domain in common but is clustered together by a misannotated fused gene, and ii) some orthogroups were separated due to protein properties (e.g., highly diverged protein, small size protein) due to a unified parameter adjusted for all different gene families. We were not able to discard some of the problematic genes from whole orthogroups because we don’t have strong evidence to reject published gene modeling data. Therefore, we manually confirmed controversial genes that appeared to be misannotated compared to sister species or strains (i.e., parsimonious approach) for further analysis.

Phylogenetic analysis of genes. To determine the evolutionary history of target genes, we obtained homologous protein sequences from the NCBI non-redundant protein sequence database by applying protein homologous searches using MMSeqs2 v13.45111. Sequences collected for phylogenetic analysis were aligned using MAFFT v7.310, and some alignments with a lot of gaps were trimmed out using trimAl v1.4 '-automated1' option. IQ-TREE v2.1.2 was used for Maximum Likelihood (ML) inference of the phylogenetic tree. To select evolutionary models, implemented model selection was used, and 1,000 replications of the ultrafast bootstrap approximation approach (UFBoot2) were used for phylogenetic analysis. After phylogenetic trees were constructed, we removed a few taxa that seem to be redundant due to dataset disequilibrium of taxon sampling (e.g., extensively sequenced in a particular lineage). Following the removal of redundant taxa, we reanalyzed the datasets, beginning with the alignment and performing the phylogenetic analysis as previously stated. The final trees were visualized using FigTree v1.4.4 with a midpoint root or an unrooted tree if outgroups were not considered from the start.

Identification of subtelomere and gene duplication ratio. To identify subtelomeric regions from genomes, LASTZ alignment v7.0.2 was used to determine if there were conserved regions between chromosomes 108. Subtelomere regions near telomeric repeats were manually confirmed using LASTZ alignments across chromosomes, and subtelomeric genes were identified within subtelomeric regions. We attempted to remove paralogs from gene duplication detection and focus on recent gene duplications in order to calculate the proportion of subtelomeric gene duplication when compared to the overall number of gene duplications. DIAMOND v2.0.5.143 with variable parameters was applied to conduct protein homology searches (blastp) between each protein sequence in the entire proteomes. Query and subject coverage (‘-q’,’-s’) were set to 70 to 90% with 5% intervals, and protein identity (‘-i’) was set to 70-90% with 5% intervals as well. As a result, this analysis used a total of 25 parameter combinations, which were visualized in a plot. Fisher's exact test ('fisher.test'), implemented in R was used independently to test the significance of subtelomeric regions and gene duplication in each species.

Analysis of histone modification ChIP-Seq data. We used previously sequenced ChIP-Seq data [Input DNA, histone H3 (H3), and tri-methylation of lysine 27 on histone H3 (H3K27me3)] from Cyanidioschyzon merolae 10D to confirm H3K27me3 histone modification pattern in Cyanidiophyceae. All ChIP-Seq data were mapped against the Cyanidioschyzon genome using Bowtie2 (v2.3.4.1), and the peaks were identified with Model Based Analysis of ChIP-seq data (MACS3 v3.0.0a7). Input DNA data was used as a control for both H3K27me3 and H3. Enrichment of H3K27me3 peaks refer to the MACS3-calculated log fold changes over H3 and we used calculated fold-enrichment information for further analysis. We used IGV v2.11.0 for visualizing the output findings of "broadPeak" and "gappedPeak," which were signal enrichment based on pooled and normalized data.

Non-synonymous substitutions per non-synonymous sites (Ka) and synonymous substitutions per synonymous sites analysis. To assess evolutionary selection of subtelomeric duplicated genes, each of the subtelomeric duplicated genes was aligned by MAFFT v7.471. Ka/Ks analysis was performed using ParaAT v2.0 and KaKs_Calculator v2.0.

Characterization of polycistronic transcripts. We used deduplicated high-quality transcripts from PacBio Iso-Seq circular consensus sequencing (CCS) reads to identify polycistronic transcripts, and all transcripts were mapped to the genome using STARlong v2.7.5a. Using gene modeling information and mapped information, polycistronic transcripts were identified by an in-house python script based on the criteria of completely covering at least two gene regions in the same direction as the transcript. After identifying of polycistronic loci, internal ribosome entry sites (IRESs) were identified from all putative polycistronic transcripts using IRESfinder.

cDNA synthesis and PCR. To verify polycistronic gene expression, we synthesized cDNA using Thermo Scientific First Strand cDNA Synthesis Kit cat. #K1612 (Thermo Fisher Scientific, Waltham, USA). Before synthesizing cDNA from extracted RNAs, we treated DNase I cat. #EN0521 (Thermo Fisher Scientific, Waltham, USA) to prevent DNA contamination. Oligo(dT)18 primers were used for cDNA synthesis. Customized polycistronic primers were designed for polymerase chain reaction (PCR). AccuPower® PCR PreMix cat. #K-2012 (Bioneer, Daejeon, Korea) was used for PCR, and PCR products were purified with LaboPass™ PCR Purification Kit cat. #CMR0112 (Cosmo Genetech, Seoul, Korea) for Sanger sequencing (Macrogen, Seoul, Korea).

Protein homeostasis survey. Hidden Markov models (HMMs) of different chaperone types (Hsp20, 40, 60, 70, 90, and 100) were used for hmmsearch and were performed for profiling chaperones in our data. TANGO v2.3.1 was used for estimating proteins aggregation propensity based on protein sequences and hydrophobicity of proteomes was computed with the Kyte–Doolittle hydropathy scale using python and R scripts (https://github.com/pechmannlab/chapevo). Statistical tests were done with R packages providing t-test and others.

Circular dichroism (CD) spectroscopy. Oligonucleotides were designed based on telomeric repeats of cyanidiophycean species and compared to previously confirmed G-quadruplex forming telomeric tandem repeats. DNA samples for CD spectroscopy were prepared in 10 mM Tris–HCl (pH 7.4), 1 mM EDTA, 150 mM KCl, and 40% (w/v) PEG 200 cat. P3015 (Sigma-Aldrich, St Louis, USA) to induce the macromolecular crowding effect and stabilize G-quadruplex structures. Before the experiment, the DNA mixtures were heated at 95 °C for 5 minutes and cooled to room temperature (at least 20 minutes). Circular dichroism (CD) measurements of oligonucleotides were performed on a Jasco J-815 spectropolarimeter (Jasco, Tokyo, Japan) at 25°C using Hellma® Macro-cuvette 110-QS (1 mm path length). CD spectra of various DNA samples (5 μM DNA) were recorded from 350 to 200 nm using a 1 nm scale and a scanning speed of 100 nm/min. CD spectra measurements were repeated three times for each sample, and mean values were used. The ggplot2 R packages' 'geom_smooth' function was used to plot CD spectra (mdeg) by wavelength (nm).

Supplementary dataset for: Genome-wide signatures of adaptation to extreme environments in red algae

Data files

Abstract

Methods

Works referencing this dataset