Resolving the early divergence pattern of of teleost fish using genome-scale data
Takezaki, Naoko (2021), Resolving the early divergence pattern of of teleost fish using genome-scale data, Dryad, Dataset, https://doi.org/10.5061/dryad.v9s4mw6rm
Regarding the phylogenetic relationship of the three primary groups of teleost fishes, Osteoglossomorpha (bonytongues and others), Elopomorpha (eels and relatives), Clupeocephala (the remaining teleost fish), early morphological studies hypothesized the first divergence of Osteoglossomorpha, whereas the recent prevailing view is the first divergence of Elopomorpha. Molecular studies supported all the possible relationships of the three primary groups. This study analyzed genome-scale data from four previous studies: (1) 412 genes from 12 species, (2) 772 genes from 15 species, (3) 1,062 genes from 30 species, and (4) 491 UCE loci from 27 species. The effects of the species, loci, and models used on the constructed tree topologies were investigated. In the analyses of the datasets (1) - (3), although the first divergence of Clupeocephala that left the other two groups in a sister relationship was supported by concatenated sequences and gene trees of all the species and genes, the first divergence of Elopomorpha among the three groups was supported using species and/or genes with small divergence of sequence and amino-acid frequencies. This result corresponded to that of the UCE dataset (4), whose sequence divergence was low, which supported the first divergence of Elopomorpha with high statistical significance. The increase in accuracy of the phylogenetic construction by using species and genes with low sequence divergence was predicted by a phylogenetic informativeness approach and confirmed by computer simulation. These results supported that Elopomorpha was the first basal group of teleost fish to have diverged, consistent with the prevailing view of recent morphological studies.
Sequence Data Used
Amino acid sequence data from three previous studies and nucleotide sequence from one study were analyzed (Tables 1 and S1). The data from Bian et al. (2016) were provided by the authors. Out of 418 genes for 12 species [coelacanth (Latimeria chalumnae) and eight ray-finned fish, including one non-teleost fish [gar (Leipdosteus oculatus)] and seven teleost fishes [three Osteoglossomorpha [arawana or Asian bonytongue (Scleopages formosus), butterflyfish (Pantodon buchholzi) and knifefish (Papyrocranus afer)], two Elopomorpha [European eel (Anguilla anguilla), tarpon (Megalops cyprinoides)], five Clupeocephala [zebrafish (Danio rerio), electric eel (Electrophorus electricus), medaka (Oryzias latipes), fugu (Takifugu rubripes), and stickleback (Gasterosteus aculeatus)]. Six genes whose number of shared amino acid sites smaller than 50 were excluded. Thus, a set of 412 genes from the 12 species was used for the analyses (Tables 1 and S2).
Data from Chen et al. (2015), Hughes et al. (2018), and Faircloth et al. (2013) were downloaded from the Dryad Digital Repository. In the data from Chen et al. (2015), there were amino acid sequences of 14 ray-finned fish: 11 teleost fish, including one Elopomorpha [Japanese eel (Anguilla japonica)], one Osteoglossomorpha [silver arawana (Osteoglossum bicirrhosum)], nine Clupeocephala species [zebrafish (D. rerio), catfish (Ictalurus punctatus), tetra (Astyanax mexicanus), cod (Gadus morhua), tilapia (Oreochromis niloticus), platyfish (Xiphophorus maculatus), medaka (O. latipes), stickleback (G. aculeatus), fugu (T. rubripes)], and three non-teleost fish [gar (L. oculatus), sturgeon (Acipenser transmontanus), and bichir (Polypterus senegalus) (Table S2). The genes that included all 14 ray-finned fish species and the coelacanth (L. chalumnae) were extracted from the total gene set (4,682 genes) and those with less than 50 shared amino acid sites were excluded [Total set, 772 genes]. Within the Total set, genes included in the dataset in which teleost species formed a monophyletic cluster, the top-1000 and -500 slowly evolving gene sets (Chen et al, 2015) were extracted: Teleost set (542 genes), Slow1000 set (190 genes), and Slow500 set (96 genes). In the preliminary study, the sets of top-200 and -100 slowly evolving genes were created by choosing the genes with short total branch lengths estimated for the trees of 15 species. However, the results were essentially the same as those of the Slow1000 and Slow500 sets. Therefore, it was decided to use the Slow1000 and Slow500 sets.
In the Hughes et al. (2018) data there were 1,105 individual genes. The individual genes contained 305 species in total: frog (Xenopus tropicalis), coelacanth (L. chalumnae), lungfish (Protopterus aethiopicus), 10 non-teleost ray-finned fishes [three Polypteriformes, four Acipenseriformes, four Holostei (one Amiiformes, three Lepisosteiformes)], and 292 teleost fishes [seven Elopomorpha, six Osteoglossomorpha, and 279 Clupeocephala species] (Tables S1, S3, and S4). Out of 1,105 genes, six genes that contained no Osteoglossomorpha sequences were excluded (1,099-gene set) (Tables S3 and S4). Because the focus of this study is to resolve the relationships of Elopomorpha, Osteoglossomorpha and Clupeocephala, nine Clupeocephala species [Atlantic herring (Clupea harengus), golden-line barbel (Sinocyclocheilus grahami), red-bellied piranha (Pygocentrus nattereri), northern pike (Esox lucius), grayling (Thymallus thymallus), silver eye (Polymixia japonica), blackbar soldierfish (Myripristis jacobus), yellowfin tuna (Thunnus albacares), and northern snakehead (Channa argus) that have low proportion of missing data and relatively low divergence were selected. Three Elopomorpha species (Gymnothorax reevesii, Conger cinereus, Kaupichthys hyoproroides), and one outgroup (Acipenser naccarii) which appeared in a small number of loci (≤ 171) were excluded (30 species in total and 25.5 ± 4.0 per locus, Table S2). From the 1,099-gene set, loci in which some species have unusually long branch from the common ancestral node of teleost fish (>3 substitutions per site) and whose number of sites was smaller than 50 were excluded (1,062 loci) (Tables 1 and S1) (Hughes data).
Although nucleotide sequence data were available for the Bian data and Hughes data, this study analyzed amino acid sequence data, because synonymous nucleotide sites were likely to be subjected to saturation due to of the long time that separates Elopomorpha, Osteoglossomorpha, and Clupeocephala (more than 250 million years, e.g., Near et al. 2012; Hughes et. al. 2018). Multiple substitutions which are not correctly identified can generate spurious phylogenetic signals (e.g., Philippe et al. 2005a; Philippe et al. 2011). Using concatenated nucleotide sequence of the Bian data and Hughes data, branch lengths (the number of substitutions per site) were estimated at the third codon positions where most of substitutions are synonymous and at the first and second codon positions where most of substitutions are nonsynonymous separately, assuming the tree topologies corresponding to Tree 1. Indeed, synonymous substitutions were likely saturated, because the numbers of substitutions per site between Elopomorpha, Osteoglossomorpha, and Clupeocephala and the outgroup at the third codon positions were close to two for the two data (average 1.75, min. 1.24 and max. 2.47 for the Bian data and average 1.70, min. 1.19, and max. 2.61 for the Hughes data). In contrast nonsynonymous substitutions were not likely saturated because the numbers of substitutions per site at the first and second codon positions were much smaller than one (average 0.26 for the Bian data and 0.32 for the Hughes data). However, because there are more possible states in amino acid sequence (20 states) than nucleotide sequence at the first and second positions (16 states), the resolution power of amino acid sequence could be higher than that of nucleotide sequence. Therefore, amino acid sequence data were used in this study.
In UCE data from Faircloth et al. (2013), there were four outgroups [bichir, lake sturgeon (Acipenser fluvescens), bowfin (Amia calva), and gar], two Elopomorpha [Megalops sp. and slender giant moray (Strophidon sathete) and two Osteoglossomorpha (silver arawana and butterflyfish) and 19 Clupeocephala species (Table S2). Of the 491 UCE loci in the downloaded data, 278 loci that contained at least one species in each of the four groups (outgroup, Elopomorpha, Osteoglossomorpha, and Clupeocephala) (Table S1) were used for gene-tree based approach.
Japan Society for the Promotion of Science, Award: 15K08187