Supporting data for: The genome of the relict earless monitor lizard, Lanthanotus borneensis, and the Toxicofera hypothesis
Data files
Apr 01, 2026 version files 14.33 GB
-
LborII_TBG1049_anno.gff.gz
314.53 MB
-
LborII_TBG1049_cds.fasta.gz
6.87 MB
-
LborII_TBG1049_proteins.fasta.gz
4.51 MB
-
LborII_TBG1049.fasta.gz
468.54 MB
-
LborII_TBG1049.vcf.gz
13.53 GB
-
README.md
3.79 KB
-
SCOS_Genetrees.tar.gz
316.31 KB
-
SCOS_Sequences.tar.gz
3.23 MB
-
SUPERMATRIX.treefile
1.17 KB
-
SUPERTREE.nwk
950 B
Abstract
The Earless Monitor Lizard, Lanthanotus borneensis, is a unique living fossil restricted to the island of Borneo and a possible key to understanding the evolution of the venom delivery system and secondary adaptation to water in reptiles and snakes (Squamata). We have sequenced and de novo assembled the genome of L. borneensis to a total size of 1.5 Gbp, 975 contigs with an N50 of 52 Mbp and an L50 of 9. The genome completeness is estimated to be 93% based on the Sauropsida OrthoDB core gene set. A genome-wide set of Lepidosauria orthologs was compiled to reconstruct and date their phylogeny, resulting in 966 protein-coding sequences amounting to a concatenated alignment of 356 kbp with 188 kbp parsimony-informative sites. Based on this phylogenomic analysis, one of the largest of its kind yet conducted for Squamata, we identified that a Toxicofera clade (comprising Serpentes, Anguimorpha, and Iguania) is supported by a plurality of gene trees, but critically, support for relationships within Toxicofera is almost equally distributed amongst the three possible topologies. Our tree-dating confirms a rapid divergence of all major squamate clades within the first 10% of squamate history, which may have contributed to rampant incomplete lineage sorting. While we did not identify positive selection on genes associated with venom components at the base of Toxicofera, our analyses find strong positive selection on the giant protein titin throughout the main clades of Toxicofera and especially in snakes. Genome-wide heterozygosity is low (HO = 0.0004), as is the effective population size towards the present. Future studies of the evolution of the venom delivery system in Toxicofera require a “true” species tree but also individual gene trees due to incomplete lineage sorting and the concomitant potential for hemiplasy. While we found no support for positive selection on venom-related genes at the origin of Toxicofera, titin – a key component of striated muscle elasticity – emerges as a target for future evolutionary studies in Toxicofera and especially in wide-gaped snakes (“Macrostomata”). The low observed genome-wide heterozygosity and the low but stable effective population size of L. borneensis during the large-scale habitat fluctuations on Sundaland in the Quaternary suggest an unexpected resilience to environmental perturbations but also a potentially lowered adaptive potential of this isolated lineage.
Dataset DOI: 10.5061/dryad.stqjq2cg9
Description of the data and file structure
Summary:
This dataset contains a de novo genome assembly as well as a detailed annotation of the Earless Monitor Lizard (Lanthanotus borneensis). The genome was assembled using PacBio CLR reads using the software Miniasm.
The annotation was done on a genome with soft-masks for repetitive regions using the MAKER pipeline
Furthermore, this dataset includes a collection of high-quality orthologous sequences collected between Lepidosauria genomes using the GEMOMA-to-Phylogeny pipeline and publicly available homology data. These orthologs are provided as amino acid fasta sequences as well as a collection of gene trees constructed using IQTree and a maximum likelihood (ML) approach with 1000 bootstrap replications. Final species trees were constructed using a concatenated matrix and ML approach but also using a consensus approach using ASTRAL-III.
Files and variables
LborII_TBG1049.fasta.gz: a zipped fata file containing the original de novo assembly without changes made by NCBI and used for all downstream analysis. Use gunzip LborII_TBG1049.fasta.gz to extract the original file.
LborII_TBG1049_proteins.fasta.gz: A zipped fasta file containing the amino acid sequences annotated using the MAKER pipeline. Use gunzip LborII_TBG1049_proteins.fasta.gz to extract the original file.
LborII_TBG1049_cds.fasta.gz: A zipped fasta file containing the coding sequences annotated using the MAKER pipeline. Use gunzip LborII_TBG1049_cds.fasta.gz to extract the original file.
LborII_TBG1049_anno.gff.gz: A zipped gff file containing the annotated gene models constructed with the MAKER pipeline. Use gunzip LborII_TBG1049_anno.gff.gz to extract the original file.
LborII_TBG1049.vcf.gz: A zipped vcf file containing the called and filtered genotypes for the LborII genome using mapped Illumina short reads and the BCFtools collection. Use gunzip LborII_TBG1049.vcf.gz to extract the original file.
SCOS_Sequences.tar.gz: A compressed tar-ball containing 966 ortholog sequences in fasta format that were identified using the GEMOMA-to-Phylogeny pipeline (https://github.com/mag-wolf/GEMOMA-to-Phylogeny) and 22 other Lepidosauria genomes and annotations. Use tar -xzvf SCOS_Sequences.tar.gz SCOS_Sequences to extract the tarball.
SCOS_Genetrees.tar.gz: A compressed tar-ball containing 966 ortholog genetrees in newick format that were identified using the GEMOMA-to-Phylogeny pipeline (https://github.com/mag-wolf/GEMOMA-to-Phylogeny) and 22 other Lepidosauria genomes and annotations. Use tar -xzvf SCOS_Genetrees.tar.gz SCOS_Genetrees to extract the tarball.
SUPERMATRIX.treefile: The final species tree in newick format using a concatenated matrix of all ortholog sequences constructed with an ML approach using IQTree. You may open this file in e.g. iTOL (https://itol.embl.de).
SUPERTREE.nwk: The final species tree in newick format using a consensus approach of all constructed gene trees using ASTRAL-III. You may open this file in e.g. iTOL (https://itol.embl.de).
Access information
Other publicly accessible locations of the data:
- All sequencing data was uploaded to NCBI and is assigned to the BioProject PRJNA1313344, the BioSample SAMN50892088 and raw reads can be accessed via the SRA repositories SRR35210199 (PacBio CLR) and SRR35210198 (Illumina) while the reference genome is assigned to the genome ID XXXXXXXXX (Submission number: SUB15584084).
Genome construction
A de novo assembly was constructed using the PacBio CLR long reads and Illumina paired-end short reads. Initial assembly was conducted with Miniasm v0.3_r179 (Li, 2016), following read overlap detection using Minimap2 v2.14 (Li, 2018) with the “map-pb” preset. The resulting assembly was polished with two rounds of Racon v1.3.1 (Vaser et al., 2017), each preceded by realignment of the PacBio reads to the intermediate assembly with Minimap2. Subsequently, two rounds of polishing were performed with Pilon v1.23 (Walker et al., 2014), using the Illumina paired-end reads. Reads were aligned to the intermediate assembly using BWA-MEM v0.7.17 (Li, 2013), and alignments were sorted and indexed using Samtools v1.9 (Li et al., 2009). Pilon v1.23 (Walker et al., 2014) was run with default parameters using the “--frags” option.
To annotate repetitive elements, the assembly was first soft-masked using RepeatMasker v4.1.5 (http://www.repeatmasker.org) with the built-in Anopheles repeat library (--species anopheles, -xsmall). The masked genome was subsequently used to construct a RepeatModeler database with RepeatModeler v1.0.11 (http://www. repeatmasker.org/RepeatModeler/) using default parameters to identify novel repeat families. The resulting consensus library was then used in a second round of RepeatMasker to annotate repeats comprehensively. All RepeatMasker runs included GC content calculation (-gccalc). Gene annotation was performed using MAKER v2.31.8 (Holt et al., 2011). As input, a soft-masked version of the final genome assembly was provided. Protein homology evidence was supplied in the form of RefSeq protein annotations obtained from NCBI for the following reptile genome assemblies: Anolis carolinensis (AnoCar2.0, GCF_000090745.2), Pogona vitticeps (pvi1.1, GCF_900067755.1), Python bivittatus (Python_molurus_bivittatus-5.0.2, GCF_000186305.1), and Ophiophagus hannah (OphHan1.0, GCA_000516915.1). Repeat masking employed the built-in simple model (model_org=simple) along with the default transposable element (TE) protein database distributed with MAKER (repeat_protein=te_proteins.fasta). Ab initio gene prediction was conducted using Augustus v2.5.5, configured with the Homo sapiens gene models. tRNA prediction was disabled (trna=0), and ab initio prediction was restricted to unmasked regions only (unmask=0). General MAKER settings included an increased maximum DNA chunk size of 100,000 bp (max_dna_len=100000) and an expanded expected maximum intron size of 10,000 bp (split_hit=10000), optimizing the annotation process for large eukaryotic contigs.
Phylogenomics
Phylogenomic reconstructions were performed using the GeMoMa-to-Phylogeny wrapper function as presented in further detail in Wolf et al. 2022. Publicly available genome assemblies as well as the here constructed genome were annotated based on homologous information using the GeMoMa v1.7.1 pipeline (Keilwagen et al. 2016). Resulting annotations were used in ortholog calling with OrthoFinder v.2.5.2 (Emms and Kelly 2019) and we extracted single copy orthologous sequences (SCOS) with not more than 25% missing species. Orthologous sequences were aligned with Mafft v7.475 applying 1000 iterative refinements. Alignments were trimmed using ClipKit v1.1.3 in “kpic-smart-gap” mode to allow for an additional smart-gap-based trimming. Based on the trimmed alignments, gene trees were constructed with IQtree v2.1.2 (Minh et al. 2020) with 1000 bootstrap replications each. We further filtered gene trees and alignments based on the maximum likelihood genetic distance calculated by IQtree. To do so, we removed orthologs in the 5% and 95% quantile to avoid taking misalignments into account as well as sequences with too little information for a meaningful tree construction. Subsequently, all alignments were concatenated using FASconCAT-G v1.04 (Kück et al. 2014) and an overall tree was compiled with IQtree using the same 1000 bootstrap replications. Additionally, Astral-III v5.7.3 (Zhang et al. 2018) was used to create a consensus tree based on all individual gene trees which also performed quartet score calculation to assess the amount of genetic conflicts within the dataset.
