Skip to main content

Myoglobin primary structure reveals multiple convergent transitions to semi-aquatic life in the world's smallest mammalian divers

Cite this dataset

He, Kai et al. (2021). Myoglobin primary structure reveals multiple convergent transitions to semi-aquatic life in the world's smallest mammalian divers [Dataset]. Dryad.


Identifying the phylogenomic underpinnings of specialized phenotypes that fueled evolutionary transitions into new adaptive zones is central to evolutionary biology. The order Eulipotyphla (e.g., moles, shrews, and hedgehogs) is ideally suited to address this question as semi-fossorial, fossorial, and semi-aquatic forms have repeatedly arisen from terrestrial forbearers. However, our understanding of the ecomorphological pathways leading to these diverse lifestyles has been confounded by a fragmentary fossil record and potential morphological convergence. The net surface charge of myoglobin (ZMb) is readily determined from its primary structure and provides an objective target to map ancient evolutionary transitions due to mechanistic linkages of ZMb with myoglobin concentration. Myoglobin facilitates O2 storage and transport in muscle and its concentration is sharply elevated in breath-hold divers relative to terrestrial mammals, with fossorial and high-elevation species only showing minor increases. Here we trace the evolution of ZMb to unravel the history of lifestyle transitions in the clade containing the world’s smallest endothermic divers. We first constructed a comprehensive phylogeny that resolved previously intractable intra-family relationships, and confirmed that ZMb accurately predicts aquatic habits within Eulipotyphla. Ancestral reconstructions of ZMb, which included representatives from all seven recognized semi-aquatic genera, provide key insights into the timing and mode of adaptations that underpin the evolution of the diverse ecomorphotypes within Eulipotyphla, and unambiguously revealed that semi-aquatic lifestyles evolved twice in moles, and three times in shrews. Our phylogenetically informed analysis supports ZMb as an effective tool to trace ancient secondary aquatic transitions of mammals based on protein sequence alone.


Our taxon sampling of eulipotyphlan mammals included 44 talpids, 11 shrews, five erinaceids, and one solenodon (61 specimens encompassing 60 species). Note that this sampling incorporates talpid specimens from five putative ‘cryptic lineages’ (denoted by ‘sp.’, ‘sp. 1’, or ‘sp. 2’ in the figures and tables); for the purpose of this study, each of these genetically distinct lineages are considered independent species. The tissue samples were from various resources (supplementary table S1), with most tissue samples provided by co-authors from China, Japan, Canada, and the USA. Voucher specimens collected by co-authors were deposited in the Kunming Institute of Zoology (KIZ, China), the National Museum of Nature and Science (NMNS, Japan) or kept in personal collections (A.S. and S.I.K.). Additional tissue samples were obtained with permission from the National Museum of Natural History (USNM, USA), the Burke Museum of Natural History and Culture (NWBM, USA), the Field Museum of Natural History (FMNH, USA), and the New Mexico Museum of Natural History (NMMNH, USA).

For each specimen, we used a capture hybridization approach (Mason, et al. 2011; Horn 2012) to enrich segments of 25 mammalian tree-of-life genes (Meredith, et al. 2011) for phylogenetic analyses. We first downloaded tree-of-life sequences from three eulipotyphlan whole genome sequences available in GenBank (Erinaceus europaeus, Sorex araneus, Condylura cristata), together with 60 bp of 5’- and 3’- flanking sequence for each target. We then aligned each gene segment using MAFFT (Katoh and Standley 2013). The resulting alignments were used to design 120 mer RNA probes (baits) that overlapped by 90 bp (4x tiling), and collapsed any replicates with up to six mismatches (95% similarity) for each segment. For example, if the 120-bp gene fragments from all species were 95% similar with each other, only one probe was designed for this region, otherwise two or more probes were designed to cover the heterogeneity. The probes were synthesized by Arbor Biosciences (Ann Arbor, MI, USA). As a first step in DNA library construction we extracted total DNA from each specimen using a Qiagen DNeasy Blood & Tissue Kit (Qiagen, Canada). The quality and quantity of each DNA sample was measured using a Nanodrop 2000. We then sheared the total DNA into smaller fragments using NEBNext dsDNA Fragmentase (New England Biolabs, Canada), and used this as template to construct DNA libraries using a NEBNext Fast DNA Library Prep Set for Ion Torrent kit (New England Biolabs, Canada). Each sample library contained a unique barcode adapter (NEXTflex DNA Barcodes for Ion Torrent, BIOO Scientific, USA). We selected libraries within the size range of 450-500 bp using a 2% E-gel on an E-Gel Electrophoresis System (Invitrogen, Canada), and re-amplified the size-selected libraries using a NEBNext High-Fidelity 2X PCR Master Mix (New England Biolabs, Canada). Finally, we purified the libraries using Serapure magnetic beads, and measured DNA concentrations using a Qubit 2 Fluorometer (Thermo Fisher Scientific, Canada).

We pooled up to four DNA libraries of similar quality and concentrations before hybridization to avoid biased target captures (e.g., baits being used up by one sample). Approximately 500 ng (100 ng to 1000 ng) pooled DNA library was used for each hybridization. We conducted in-solution hybridization using a myBaits® custom target capture kit (Arbor Biosciences, Ann Arbor, MI, USA) following the MYbaits user manual v3.0. The enriched libraries were re-amplified and purified as above. We thereafter measured the DNA concentration using a Qubit flourometer and pooled the enriched libraries for sequencing. The libraries were sequenced using either v318 chips on an Ion Torrent Personal Genome Machine (PGM) or an Ion PI Chip v3 via an Ion Proton Machine.

Ion Torrent sequencing technology is characterized by higher error rates than Illumina (Jünemann, et al. 2013), and Ion Torrent platforms produce single-end (rather than pair-end) reads. We therefore conducted comprehensive data cleaning and reconciliation procedures, and selected software which could handle single-end sequencing data. The raw data were automatically demultiplexed, trimmed, and converted to FASTQ format on the Torrent Suite v4.0.2 (Thermo Fisher Scientific, Canada) after sequencing. Briefly, we trimmed contaminant (adapters and barcodes) sequences with AlienTrimmer (Criscuolo and Brisse 2013) using conservative parameters (-k 15 -m 5 -l 15 -q 0 -p 0). To remove poor quality data, we used the DynamicTrim function of the software SolexaQA++ v3.1 (Cox, et al. 2010) to trim sequences dynamically and crop the longest contiguous segment for each read. We set the probability value to 0.01 (i.e., one base call error every 100 nucleotides) in this analysis. We removed duplicated and near-duplicated reads for each sample as implemented in ParDRe using all default parameters (Gonzalez-Dominguez and Schmidt 2016). Finally, we conducted data correction using Karect, a multiple sequence alignment-based approach (Allam, et al. 2015), because this software handles substitution, insertion, and deletion errors. The output files of Karect were used for sequence assembly.

We de novo assembled the raw sequences for each sample using Abyss v2.0 (Simpson, et al. 2009), MRIA v4.0 (Chevreux, et al. 2004), and SPAdes v3.10 (Bankevich, et al. 2012), all of which were designed for short read sequencing data. Abyss is able to use a paired de Bruijn graph instead of a standard de Bruijn graph by specifying a k-mer size (K) and a k-mer pair span (k). We set the K and k to 17 and 33, respectively, and set the maximum number of branches of a bubble to 5 in our analyses. MIRA is based on a Smith-Waterman algorithm. We ran MIRA using specific parameters including bases_per_hash = 31 and minimum_read_length = 35. The SPAdes assembler is also based on a de Bruijn graph, and we set only one k-mer value of 33 for analyses. It is known that merging different draft assemblies (i.e., reconciliation) could improve the assembly quality (Zimin, et al. 2008). We therefore conducted reconciliation using Geneious R11 ( We concatenated the assembled draft contigs generated in three assemblers into a list. We removed contigs shorter than 120 bps, and used the BBMap dedupe function to remove duplicate contigs. We conducted assemblies using the Geneious assembler to group draft contigs with a minimum overlap identity of 96% to a new contig. Finally, all the new contigs and the leftover draft contigs were grouped into a contig list for subsequent analyses.


National Natural Science Foundation of China, Award: 31970389

National Natural Science Foundation of China, Award: DEB-1457735

National Natural Science Foundation of China, Award: RGPIN/238838-2011