Data from: Coevolution of Drosophila-type timeless with partner clock proteins
Data files
Mar 10, 2025 version files 2.35 MB
-
BRWD3-333seq.fasta
549.52 KB
-
CRYs__Phr6-4-710seq.fasta
468.04 KB
-
dTIM-454seq.fasta
566.53 KB
-
JETLAG_245-seq.fasta
99.61 KB
-
mTIM_TOF_protein_112_seq.fasta
149.17 KB
-
PER-390seq.fasta
508.10 KB
-
README.md
4.34 KB
Abstract
Drosophila-type timeless (dTIM) is an established key clock protein in fruit flies, regulating rhythmicity and light-mediated entrainment. However, as indicated by functional experiments, its contribution to the clock differs in various insects. Therefore, we conducted a comprehensive phylogenetic analysis of dTIM across animals, dated its origin, gene duplications, and losses. We identified variable and conserved protein domains and pinpointed animal lineages that underwent the biggest changes in the dTIM sequence. While dTIM modifications are only mildly affected by changes in the PER protein, even the complete loss of PER in echinoderms had no impact on dTIM. However, changes in dTIM always co-occur with the loss of CRYPTOCHROMES or JETLAG. This is exemplified by the remarkably accelerated evolution of dTIM in phylloxera and aphids. Finally, alternative d-tim splicing, characteristic of D. melanogaster temperature-dependent function, is conserved at least to some extent in Diptera, albeit with unique alterations. Altogether, this study pinpoints major changes that shaped dTIM origin and evolution.
https://doi.org/10.5061/dryad.44j0zpcq0
Description of the data and file structure
Dataset contains sequences in the fasta format. The sequence title (after >) contain species name, higher taxon name (order and similar), accession number, and additional attributes (such as abbreviation TSA for protein seq translated from Transcriptome Sequence Assembly cDNA sequence). The data are amino acid sequences of the protein wrapped in 80-character line format.
mTIM TOF protein 112 seq.fasta
- Document with 112 protein sequences of the mammalian-type TIMELESS (mTIM) and TOF proteins used in the phylogeentic analysis of dTIM origin
PER-390seq
- Document with 390 sequences of PER proteins used in the analysis of PER substitution rate
dTIM-454seq
- Document with 454 sequences of the Drosophila-type TIMELESS (dTIM) proteins
CRYs, Phr6-4-710seq
- Document with 710 sequences of the Drosophila-type CRY protein, mammalian-type CRY protein, and Phr6-4 photolyases
JETLAG 245-seq
- Document with 245 sequences of the JETLAG proteins
BRWD3-333seq
- Document with 333 sequences of the BRWD3 (Ramshackle) proteins
Files and variables
File: mTIM_TOF_protein_112_seq.fasta
Description: Document with 112 protein sequences of the mammalian-type TIMELESS (mTIM) and TOF proteins used in the phylogeentic analysis of dTIM origin. Dataset contains sequences in the fasta format. The sequence title (after >) contain species name, higher taxon name (order and similar), accession number, and additional attributes (such as abbreviation TSA for protein seq translated from Transcriptome Sequence Assembly cDNA sequence). The data are amino acid sequences of the protein wrapped in 80-character line format.
File: PER-390seq.fasta
Description: Document with 390 sequences of PER proteins used in the analysis of PER substitution rate. Dataset contains sequences in the fasta format. The sequence title (after >) contain species name, higher taxon name (order and similar), accession number, and additional attributes (such as abbreviation TSA for protein seq translated from Transcriptome Sequence Assembly cDNA sequence). The data are amino acid sequences of the protein wrapped in 80-character line format.
File: dTIM-454seq.fasta
Description: Document with 454 sequences of the Drosophila-type TIMELESS (dTIM) proteins. Dataset contains sequences in the fasta format. The sequence title (after >) contain species name, higher taxon name (order and similar), accession number, and additional attributes (such as abbreviation TSA for protein seq translated from Transcriptome Sequence Assembly cDNA sequence). The data are amino acid sequences of the protein wrapped in 80-character line format.
File: CRYs, Phr6-4-710seq.fasta
Description: Document with 710 sequences of CRYPTOCHROME and 6_4Photolyase proteins. Dataset contains sequences in the fasta format. The sequence title (after >) contain species name, higher taxon name (order and similar), accession number, and additional attributes (such as abbreviation TSA for protein seq translated from Transcriptome Sequence Assembly cDNA sequence). The data are amino acid sequences of the protein wrapped in 80-character line format.
File: JETLAG 245-seq.fasta
Description: Document with 245 sequences of JETLAG (FBXL15) proteins. Dataset contains sequences in the fasta format. The sequence title (after >) contain species name, higher taxon name (order and similar), accession number, and additional attributes (such as abbreviation TSA for protein seq translated from Transcriptome Sequence Assembly cDNA sequence). The data are amino acid sequences of the protein wrapped in 80-character line format.
File: BRWD3-333seq.fasta
Description: Document with 333 sequences of BRWD3 (RAMSHACKLE) proteins. Dataset contains sequences in the fasta format. The sequence title (after >) contain species name, higher taxon name (order and similar), accession number, and additional attributes (such as abbreviation TSA for protein seq translated from Transcriptome Sequence Assembly cDNA sequence). The data are amino acid sequences of the protein wrapped in 80-character line format.
Gene Identification and Data Sets
A systematic search for clock components was conducted, building on previous studies exploring the evolution, duplication, and loss of circadian clock components (Thakkar et al., 2022; Kotwica-Rolinska et al., 2022a). To identify circadian clock proteins and genes encoding dTIM, mTIM, TOF1, PERIOD, dCRY, mCRY, 6-4 Photolyase, JETLAG, and BRWD3 in Bilateria/Metazoa, GenBank (NCBI) protein and genomic databases, as well as transcriptome shotgun assemblies (TSA), were utilized. BLASTP and tBLASTn algorithms were applied with taxon-restricted searches targeting specific lineages at the levels of orders, suborders, infraorders, and, in some cases, families. In certain instances, annotated genomes or whole-genome shotgun contigs (wgs) were also examined. Protein sequences of circadian clock genes from Drosophila melanogaster, Danaus plexippus, and Pyrrhocoris apterus served as initial queries. Reciprocal and lineage-focused searches incorporated queries representing identified proteins from related taxa. To detect duplicate hits (common in TSAs) or closely related proteins (e.g., bHLH-PAS proteins instead of PERIOD), the E-INS-i algorithm in MAFFT was used for alignment, followed by FastTree analysis (Price et al., 2009), both conducted in Geneious Prime 21.0.3 (Biomatters, New Zealand).
Phylogenetic analyses
To identify specific types of TIM, CRY, FBXL, or PAS proteins (e.g., distinguishing PER from bHLH PAS proteins), sequences were aligned using the MAFFT algorithm in Geneious Prime 21.0.3 (Biomatters, New Zealand). Representative datasets containing target proteins and related types were included. Ambiguously aligned regions were trimmed, and phylogenetic analyses were conducted using RAxML with a maximum likelihood GAMMA-based model in Geneious Prime 21.0.3.
The Metazoan phylogenies presented in Figures 2, 4, 5, 6, 7, 8 and S9 were retrieved using TIMETREE 5 (Kumar et al., 2022) and cross-referenced with recent molecular phylogenomic studies. These included works focused on insects (Misof et al., 2014), chelicerates, and crustaceans (von Reumont et al., 2012; Thomas et al., 2020; Bernot et al., 2023). For insects, phylogenomic studies specific to Polyneoptera (Wipfler et al., 2019), the hemipteroid assembly (Johnson et al., 2018), and Coleoptera (McKenna et al., 2019) were used to refine the corresponding sections of the phylogeny.
Gene Loss
While it is impossible to definitively prove the absence of a gene, in some cases, gene loss is the most plausible explanation. Recent advances in phylogenomics, the availability of extensive TSA data, and an increasing number of sequenced genomes have enabled a systematic exploration of circadian clock genes across major Bilateria groups (Protostomia and Deuterostomia). Our analysis focuses on lineage-specific gene losses that are strongly supported by data from multiple species, whole-genome assemblies, and deep transcriptome sequencing. Evidence for gene loss is summarized for specific genes and animal groups in Supplementary Table 1.
Finally, the CTT-like motif, located on the ARM2 domain, and CTT motif located on dCRY, were annotated on D. melanogaster dTIM and dCRY sequences following the boundaries defined by Lin et al. (2023). After MAFFT alignment with the other dTIM sequences in the dataset, it was considered conserved when the degree of similarity exceeded 50%.
Prediction of Protein Domains
dTIM functional and binding domains were originally annotated based on the sequence of Drosophila melanogaster dTIM, following the boundaries defined by Lin et al., (2023) or identified in cell-based experiments (Saez and Young, 1996). Accordingly, the dCRY-interaction domains, ARMADILLO repeats (ARM1 and ARM2), and PER-binding sites 1 (PER-bind #1) and 2 (PER-bind #2) were annotated in the D. melanogaster dTIM protein isoform P (NP_001334730).
This annotated sequence was then aligned to each sequence in the dataset using MAFFT (Katoh and Stanley, 2013, v7.490, Algorithm: E-INS-I, scoring matrix: BLOSUM80) within the Geneious Prime 2024 software. The obtained similarities were plotted as intensities corresponding to numerical values in Figures 2 and 3. The PER-bind #1 domain was annotated when the total length of the region was at least 35 amino acids and the degree of similarity exceeded 40%. The CRY-interaction domain was highlighted when similarities exceeded 20%.
In the gene models (Figures 6 and 7), only two shades were used: regions corresponding to ARM1, ARM2, and PER-binding sites were highlighted when similarity exceeded 40%. For Limulus, a value of 38% was depicted as "low similarity" using a paler shade of red. The CRY-interaction domain was highlighted when similarity exceeded 20%. For Bemisia, "low similarity" was represented by a paler shade of blue, highlighting values of 19%. In the supplementary figures depicting protein models (Figures S2, S4, S6), specific numerical values were presented next to domain annotations when a single shade was used for each domain.
Nuclear localization signal (NLS) domains were predicted for each protein sequence using PSORT II and NLStradamus prediction software. Acidic domains were annotated based on the following criteria: a sequence located in the corresponding protein region that is at least half the size of the reference sequence, contains at least 20% acidic residues (D, E), and includes less than 10% basic residues (K, H, R). To further investigate the acidic domains, their sequence motifs were scanned. Conserved acidic motifs, presented in Figure S5, were identified using Gapped Local Alignment of Motifs (GLAM2) within the MEME Suite. This analysis focused on two distinct regions of 35 insect dTIM proteins: the sequences between ARM1 and ARM2 and the region covering the C-terminal tails.
The length of the entire protein, ARM1-ARM2 region, and C-tail length
The protein sequence was considered (likely) complete when the N-terminal part included a complete ARM1 domain starting with a methionine and the ARM2 domain was present. If a TSA (transcriptome shotgun assembly) sequence was used, the stop codon indicated the predicted C-terminal end. In the case of multiple paralogs in Daphnia, only protein sequences longer than 600 amino acids (aa) and containing both ARM domains were further analyzed.
To determine the length of the variable region (in amino acids) in the central part of the protein, all protein sequences in the dataset were aligned using MAFFT (Katoh and Stanley, 2013, v7.490, Algorithm: E-INS-I, scoring matrix: BLOSUM80). Conserved motifs corresponding to the Drosophila melanogaster ARM1 and ARM2 domains were identified and the number of aa separating ARM1 and ARM2 calculated. Additionally, conserved motifs corresponding to Drosophila YKDQ (located in ARM1) and LLLR (located in PER-bind #2 and ARM2) were identified in each dTIM protein as a parallel measurement.
To measure the length of the C-terminal tail, a conserved motif corresponding to Drosophila DLIE (located at the C-terminal end of ARM2) was identified in each dTIM. The number of amino acids between the DLIE-like motif and the C-terminus was then calculated and plotted (Figure 2E). For motif positions, see Figure S2. These values were plotted as dots representing each sequence distance in PRISM 7 for all proteins in the dataset, with exact values provided in Table S2. If a species lacked one or more conserved motifs (due to partial sequences or deletions), the length of the corresponding region was not calculated and was annotated as “n.d.” (non-determined) in Supplementary Table S2.
Substitutions per Amino Acid per Million Years
To calculate substitution rates per amino acid position for dTIM and PER proteins, we used the following approach. Protein sequences (dTIM or PER) were aligned using the E-INS-i algorithm in MAFFT (Katoh and Standley 2013). The complete alignments were used to infer phylogenetic trees with RAxML (Stamatakis 2014) under the PROTGAMMAJTT model, ensuring the topology matched the evolutionary relationships of the organisms. Constraint trees were created in TreeGraph 2 (Stöver and Müller 2010). For Crustacea, where phylogenies are still debated, we enforced monophyly with insects but did not specify internal branching. Similarly, Polyneoptera were constrained as monophyletic without defining their internal topologies.
The resulting unrooted trees were swapped to position Protostomia and Deuterostomia as sister groups. Branch lengths were extracted and summed from the Protostomia/Deuterostomia split to terminal species. These values were then divided by 700 million years (the estimated divergence time of Protostomia and Deuterostomia) to compute substitution rates per amino acid position per million years. The calculated rates are presented in Supplementary Tables 3 and 4. These values were plotted in Prism 7 (GraphPad Software, La Jolla, CA, USA), with each dot on the plot representing a single protein.
Gene Models
In brief, similarities in gene structure were assessed as previously (Smykal et al., 2020) when the gene models were either directly downloaded from GenBank species’ Whole-genome shotgun contigs (wgs) or Representative genomes (RefSeq genomes). Alternatively, gene models were manually reconstructed using genomic and transcriptomic data from GenBank, following two distinct scenarios: (1) Long genomic contigs without annotated timeless genes. The wgs contigs were manually annotated by mapping Transcriptome Shotgun Assembly (TSA) and/or non-redundant nucleotide (nr/nt) d-tim sequences by the Minimap2 mapper (Li 2021) within Geneious Prime 2024 software (Biomatters, Auckland, New Zealand). Lethocerus indicus tim gene model was reconstructed by mapping TSA sequences from three related species: Trichocorixa calva, Gelastocoris ocwulatus, Buenoa margaritacea, to an unannotated genomic scaffold. Similarly, Neoneuromus ignobilis tim gene was annotated using TSA from Protohermes xanthodes. In Pogonus chalceus, two partial non-overlapping TSA sequences cover the majority of the coding sequence (CDS). The missing region spanning parts of exons 7 and 8 was reconstructed by mapping TSAs from related beetle species Amphizoa insolens, Tribolium castaneum, and Sinaspidytes wrasei. (2) Fragmented genomic contigs without annotated timeless genes. The tim exons were annotated and contigs scaffolded using tim CDS or TSA sequences. In the case of Lepisma saccharina, Thermobia domestica d-tim TSAs were mapped to the seven L. saccharina unannotated contigs. See Table S8 for the accession number of genomic and TSA sequences used for gene reconstruction and Table S9 for accession numbers used to annotate the CDS in gene models.
Exon Homology, Similarity, and dTIM Domain Localization in Gene Models
The tim CDSs of representative species were translated and aligned using MAFFT (Katoh and Standley 2013) within Geneious Prime 2024 software to determine conserved domains and conserved/homologous exons and exon boundaries. Two protein alignments were used: “species-to-species pair alignments” and “all representative proteins alignment”. The longest d-tim isoforms of 29 representative species were analyzed in Figure 6. Available d-tim isoforms were included in Figure 7, where 28 TIM sequences from 11 dipteran species were aligned.
Two exon boundaries were determined as conserved through the species-to-species dTIM/dTIM pair alignment if a minimum of three out of five amino acids (aa) located at the exon/intron boundary were identical. The similarity of amino acids was analyzed within: One pair of homologous exons if at least one side of the exon border could be aligned, or several exons homologous to one ‘fused’ exon if both sides of the exon border could be aligned, according to species-to-species pair exon alignment using MAFFT (% Similarity, auto algorithm, scoring matrix: BLOSUM90 with threshold 1). A semi-quantitative scale with five similarity ranges (30-40; 41-50, 51-60, 61-70; 71-100 %) was used to visualize exon/exon similarity as quadrilaterals of five different levels of greyscale in Figures 6 and 7.
CRYPTOCHROME (CRY)-interaction domains and PERIOD (PER)-binding sites were mapped and annotated in dTIM protein representative sequences utilizing domains defined in Drosophila melanogaster dTIM (Lin et al., 2023; Wulbeck et al., 2005). The figure was drawn using Adobe Illustrator (version 6) software.
Gene Syntenies
In brief, gene syntenies were performed as in a recent study exploring gene duplications (Smykal and Dolezel 2023). We compared gene synteny of genes neighboring Drosophila-type cryptochrome (d-cry) in heteropteran species Apolygus lucorum, Lygus lineolaris and Lethocerus indicus with species lacking d-cry, Cimex hemipterus and Riptortus pedestris. First, candidate d-cry-containing contigs were identified in A. lucorum, L. lineolaris and L. indicus genomes by using BLAST (Basic Local Alignment Search Tool) with L. hesperus dCRY-encoding Transcribed Sequence Assembly (TSA, acc # GDHC01010508.1) sequence as a query. Four syntenic protein-coding genes were selected in A. lucorum: Axin (Axn, GeneID: GE061_006998), yippee-like (yippee, GE061_006997), DDRGK domain-containing protein 1 (Ddrgk1, GE061_006990), and eclair (eca, GE061_006989). Next, L. indicus ‘d-cry’ contig was pre-annotated using the plugin Augustus (version 0.1.1) within Geneious Prime (2024.0.5) with Rhodnius prolixus as a reference species. Then, four other syntenic genes neighboring L. indicus d-cry were selected: venom protease-like (vp-l), WD repeat-containing protein 19 (wdr19), ubiquitin carboxyl-terminal hydrolase 7 (Usp7), and inositol polyphosphate 5-phosphatase E (inpp5e).
Protein sequences of all eight syntenic genes and L. hesperus dCRY were used as queries to search in TSA databases to identify orthologs of the syntenic genes. Lygus hesperus and Lygus lineolaris TSA databases were searched for L. lineolaris and A. lucorum, Cimex lectularius for Cimex hemipterus, Lethocerus indicus and Belostoma flumineum for L. indicus, and Riptortus pedestris for R. pedestris. Gene-specific TSA sequences were then used as BLAST queries to identify syntenic genes’ genomic contigs and to map syntenic genes to the contigs. TSAs were also mapped to the A. lucorum annotated genome to verify syntenic genes’ identity and correct/build gene models. The accession numbers of genomic contigs and representative TSAs are in Table S6.
The same approach applied to d-cry was used for jetlag (jet) gene synteny, including the same species set used for TSA and genomic search and gene mapping, respectively. Lygus hesperus jet TSA (acc # GBRD01002318) was used to identify and map jet gene-containing contigs in A. lucorum, L. lineolaris and L. indicus genomes. TSAs of three protein-coding genes upstream of jet: histone acetyltransferase kat2a (kat2a, GE061_007944), transmembrane protein 59-like (tmem59, GE061_007947), ATP-dependent (S)-NAD(P)H-hydrate dehydratase (Naxd, GE061_007948) and one jet downstream gene: Dual specificity protein phosphatase (Mkp3, GE061_007950) were selected as syntenic genes from A. lucorum genomic contig. Similarly to L. indicus d-cry, four upstream jet syntenic genes: serine/arginine repetitive matrix protein 1 (srrm1), ribosomal RNA processing protein 1 homolog (rrp1), serine/threonine-protein phosphatase 4 regulatory subunit 4 (ppp4r4) and tyrosine-protein kinase Shark (Shark), and three jet downstream genes: facilitated trehalose transporter Tret1 (Tret1), PIN2/TERF1-interacting telomerase inhibitor 1 (pinx1) and Translational activator of cytochrome c oxidase 1 (Taco1), were selected as jet syntenic genes from L. indicus. The accession numbers of jet genomic contigs and representative TSAs are in Table S7. The final gene syntenies were drawn in CorelDRAW X6 (16.4.0.1280).