Genomics of cold adaptations in the Antarctic notothenioid fish radiation
Data files
Dec 09, 2022 version files 117.21 MB
-
alignments_full.zip
1.64 MB
-
astral.zip
4.24 KB
-
beast_full.zip
39.17 MB
-
beast_permissive.zip
37.96 MB
-
beast_strict.zip
27.92 MB
-
iqtree.zip
10.51 MB
-
README.txt
2.50 KB
-
species.tab
6.32 KB
Abstract
Numerous novel adaptations characterise the radiation of notothenioids, the dominant fish group in the freezing seas of the Southern Ocean. To improve understanding of the evolution of this iconic fish group, we generated and analysed new genome assemblies for 24 species covering all major subgroups of the radiation, including five long-read assemblies. We present a new estimate for the onset of the radiation at 10.7 million years ago, based on a time-calibrated phylogeny derived from genome-wide sequence data. We identify a two-fold variation in genome size, driven by expansion of multiple transposable element families, and use the long-read data to reconstruct two evolutionarily important, highly repetitive gene family loci. First, we present the most complete reconstruction to date of the antifreeze glycoprotein gene family, whose emergence enabled survival in sub-zero temperatures, showing the expansion of the antifreeze gene locus from the ancestral to the derived state. Second, we trace the loss of haemoglobin genes in icefishes, the only vertebrates lacking functional haemoglobins, through complete reconstruction of the two haemoglobin gene clusters across notothenioid families. Both the haemoglobin and antifreeze genomic loci are characterised by multiple transposon expansions that may have driven the evolutionary history of these genes.
Methods
Phylogenetic analysis was performed using single copy ortholog genes identified with BUSCO, for the 24 newly sequenced notothenioid genomes and 17 previously published genomes of seven notothenioids and ten further species of percomorph fishes. BUSCO (v2) was run with lineage “actinopterygii_odb9”, and the sequences of single- opy orthologs identified in each assembly and extracted for use in further analysis.
We used MAFFT v.7.453 to align 266 selected BUSCO genes that were single copy in our annotated gene sets. The 266 alignments were inspected by eye, and apparently misaligned sequence regions were set to missing data. A total of 1,141,524 amino acids were set to missing out of 6,410,688, including nine alignments that were excluded completely, leaving 257 alignments for further analysis. We then aligned nucleotide sequences of the same BUSCO genes according to the amino-acid alignments, ensuring that regions corresponding to the removed sequences were again set to missing data in the nucleotide sequence alignments. Sites with high entropy (entropy like score > 0.5) or high proportion of missing data (gap rate > 0.2) were removed with BMGE v.1.1 and alignments with more than three completely missing sequences, a minimum length below 500 bp, or a standard deviation of among-sequence GC-content variation greater than 0.03 were excluded. These filters were passed by 228 alignments.
Each of these alignments was subjected to Bayesian phylogenetic analysis with BEAST 2 v.2.6.0, with an uncorrelated lognormal relaxed clock model and a Markov-chain Monte Carlo chain (MCMC) length of 25 million iterations. “Strict” and “permissive” sets of alignments were compiled based on estimates of the mutation rate and its among-species variation and contained 140 and 200 of the alignments, respectively. For the strict set of 140 alignments, the permissive set of 200 alignments, and the “full” set of 257 alignments, we performed maximum-likelihood phylogenetic analyses with IQ-TREE v.1.7 after alignment concatenation, maintaining separate partitions with unlinked instances of the GTR+Gamma substitution model for each of the original alignments. Node support was assessed with 1,000 ultrafast bootstrap replicates. Each of the three analyses was complemented with an estimation of gene- and site-specific concordance factors, and the three resulting sets of gene trees were used for separate species-tree analyses with ASTRAL v.5.7.3. Finally, we estimated the phylogeny and the divergence times of notothenioid species with BEAST 2 from a concatenated alignment combining all alignments of the strict set. The original data blocks were grouped in 12 positions selected with the rcluster algorithm of PartitionFinder v.2.1.1, assuming linked branch lengths, equal weights for all model parameters, a minimum partition size of 5,000 bp, and the GTR+Gamma substitution model. The same substitution model was also assumed in the BEAST 2 analysis, together with the birth-death model of diversification and the uncorrelated lognormal relaxed clock model. Time calibration of the phylogeny was based on four age constraints defined according to a recent timeline of teleost evolution inferred from genome and fossil information, at the most recent common ancestors of clades: Eupercaria, around 97.47 MYA (2.5–97.5 inter-percentile range: 91.3–104.0 MYA); the clade combining Eupercaria, Ovalentaria, and Anabantaria – around 101.79 MYA (95.4–109.0 MYA); the clade combining these four groups with Syngnatharia and Pelagiaria – around 104.48 MYA (97.3–112.0 MYA); and the clade combining those six groups with Gobiaria – around 107.08 MYA (100.0–114.0 MYA). All constraints were implemented as lognormal prior distributions with mean values as specified above and a standard deviation between 0.033 and 0.036. Additionally, we constrained the unambiguous monophyly of the groups Notothenioidei, Perciformes, Ovalentaria, Anabantaria, and the clade combining the latter two groups. We performed six replicate BEAST 2 analyses with 330 million MCMC iterations, and convergence among MCMC chains was confirmed by ESS values greater than 120 for all model parameters and greater than 270 for the likelihood and the prior and posterior probabilities. The posterior tree distribution was summarised in the form of a maximum-clade credibility tree with TreeAnnotator v.2.6.0. We attempted to repeat the BEAST 2 analyses with the permissive and full datasets, but these proved too computationally demanding to complete. Nevertheless, the preliminary results from these analyses supported the same tree topology as the analyses with the strict dataset.