Data from: Remote homolog detection places insect chemoreceptors in a cryptic protein superfamily spanning the tree of life
Data files
Sep 14, 2023 version files 2.22 GB
-
Himmel-et-al-2023-data.zip
-
README.md
Oct 23, 2023 version files 2.22 GB
-
Himmel-et-al-2023-data.zip
-
README.md
Abstract
Many proteins exist in the so-called “twilight zone” of sequence alignment, where low pairwise sequence identity makes it difficult to determine homology and phylogeny. As protein tertiary structure is often more conserved, recent advances in ab initio protein folding have made structure-based identification of putative homologs feasible. However, structural screening and phylogenetics are in their infancy, particularly for twilight zone proteins. We present a pipeline for the identification and characterization of distant homologs, and apply it to 7-transmembrane domain ion channels (7TMICs), a protein group founded by insect Odorant and Gustatory receptors. Previous sequence and limited structure-based searches identified putatively-related proteins, mainly in other animals and plants. However, very few 7TMICs have been identified in non-animal, non-plant taxa. Moreover, these proteins’ remarkable sequence dissimilarity made it uncertain if disparate 7TMIC types (Gr/Or, Grl, GRL, DUF3537, PHTF and GrlHz) are homologous or convergent, leaving their evolutionary history unresolved. Our pipeline identified thousands of new 7TMICs in archaea, bacteria and unicellular eukaryotes. Using graph-based analyses and protein language models to extract family-wide signatures, we demonstrate that 7TMICs have structure and sequence similarity, supporting homology. Through sequence and structure-based phylogenetics, we classify eukaryotic 7TMICs into two families (Class-A and Class-B), which are the result of a gene duplication predating the split(s) leading to Amorphea (animals, fungi and allies) and Diaphoretickes (plants and allies). Our work reveals 7TMICs as a cryptic superfamily with origins close to the evolution of cellular life. More generally, this study serves as a methodological proof of principle for the identification of extremely distant protein homologs.
README: Remote homolog detection places insect chemoreceptors in a cryptic protein superfamily spanning the tree of life
https://doi.org/10.5061/dryad.fqz612jz9
Data backing Himmel, Moi and Benton (2023).
Description of the data and file structure
Data are organized in subfolders, where each subfolder corresponds to a figure in the manuscript (i.e., subfolder "Figure 1" corresponds to Figure 1 in the manuscript). Each filename indicates the specific figure panel the data correspond to (i.e., "Figure 1A-..." refers to a file relevant to Figure 1, panel A). Below are detailed descriptions of each file, sorted by the subfolder they are contained in.
Subfolder "Figure 1:"
- Figure1A-Orco-DeepTMHMM-results.zip: Raw output of DeepTMHM transmembrane prediction for A. bakeri Orco.
- Figure1A-Orco-phobius-results.zip: Raw output of phobius transmembrane prediction for A. bakeri Orco.
- Figure1B-AbakOrco-PDB-6c70-aligned-coot.pdb: Coordinates for the cryo-EM-derived 3-dimensional protein model of 6c70, oriented so that it can be viewed in alignment with the model below. This can be viewed in software such as PyMOL.
- Figure1B-DmelOrco-AF-Q9VNB5-F1-model_v4-aligned-coot.pdb: Coordinates for an AlphaFold-derived 3-dimensional protein model of Drosophila melanogaster Orco (Q9VNB5), oriented so that it can be viewed in alignment with the model above. This can be viewed in software such as PyMOL.
- Figure1C-data.csv: Comma-seperated value file with percent amino acid sequence identities and Dali Z-scores for pairwise comparison of Drosophila melanogaster 7TMICs. The 4 digit identifiers for "query" and "target" correspond to the last 4 digits of the Uniprot accession for each of the proteins. "dali z-score" refers to the Dali-derived structural similarity score between the query and target. "dali % id: refers to the Dali-derived amino acid sequence identity between the query and the target.
- Figure1D-data.csv: Comma-seperated value file with the results of a proof of concept Foldseek screen of the AlphaFold-derived D. melanogaster proteome, as in Figure 1E. "name" is UniProt accession number of the hit. "7TMIC" indicates TRUE if it is a 7TMIC, or FALSE if it is a false positive. "pid" is the percent amino acid sequence identity to Orco. "FSevalue" is the Foldseek-derived e-value. "color" is the color of the dot in plot Figure 1E.
- Figure1D-Foldseek-3DI+AA-exhaustive.m8: Output of Foldseek, corresponding to the screen in the above file.
- Figure1D-query-AF-Q9VNB5-F1-model_v4.pdb: Coordinates for a 3-dimensional protein model of Q9VNB5, corresponding to D. melanogaster Orco, which was used for the Foldseek screen in the two files above.
- Figure1E-query-models.zip: Zip file containing pdb files, each encoding the coordinates for 3-dimensional protein models used as query models for the Foldseek screen described in the associated manuscript. These can be viewed in software such as PyMOL.
- Figure1G-CLANS-network.tsv: Tab-seperated values file with the CLANS-derived protein sequence network descirbed in Figure 1H. The "source" and "target" columns are nodes (protein sequences) defined in the file detailed below. "attraction" is the CLANS derived attraction value used for clustering.
- Figure1G-CLANS-network-annotation.tsv: Tab-seperated values file with the annotation for each of the nodes defined in the file above.
- Figure1G-CLANS-network-PFDlayout.cys: File for visualizing the network in Figure 1H. Can be opened in Cytoscape.
- Figure1G-representative-sequences.fasta: Fasta file containing all of the sequences used to generate the network in Figure 1H; sequences are from 70% sequence identity clustering of the results of the distant homolog screen.
Subfolder "Figure 2:"
- Figure2-AF-monomers.zip: Zip file containing pdb files, each encoding the coordinates for 3-dimensional protein models of representative 7TMIC types. These can be viewed in software such as PyMOL.
- Figure2A-DeepTMHMM-results.zip: Raw output of DeepTMHM transmembrane prediction for Pyrodinium bahamense A0A7S0FAQ9.
- Figure2A-phobius-results.zip: Raw output of phobius transmembrane prediction for Pyrodinium bahamense A0A7S0FAQ9.
- Figure2B-DeepTMHMM-results.zip: Raw output of DeepTMHM transmembrane prediction for Tetrahymena thermophila Q23F73
- Figure2B-phobius-results.zip: Raw output of phobius transmembrane prediction for Tetrahymena thermophila Q23F73
- Figure2C-DeepTMHMM-results.zip: Raw output of DeepTMHM transmembrane prediction for Halieaceae bacterium A0A7V1EQR6.
- Figure2C-phobius-results.zip: Raw output of phobius transmembrane prediction for Halieaceae bacterium A0A7V1EQR6.
- Figure2D-DeepTMHMM-results.zip: Raw output of DeepTMHM transmembrane prediction for Haloarcula japonica M0LGF2
- Figure2D-phobius-results.zip: Raw output of phobius transmembrane prediction for Haloarcula japonica M0LGF2
- Figure2E-DeepTMHMM-results.zip: Raw output of DeepTMHM transmembrane prediction for Heimdallarchaeota A0A1Q9PEH0
- Figure2E-phobius-results.zip: Raw output of phobius transmembrane prediction for Heimdallarchaeota A0A1Q9PEH0
- Figure2F-DeepTMHMM-results.zip: Raw output of DeepTMHM transmembrane prediction for Euryarchaeota archaeon A0A2D7BD00
- Figure2F-phobius-results.zip: Raw output of phobius transmembrane prediction for Euryarchaeota archaeon A0A2D7BD00
Subfolder "Figure 3:"
- Figure3A-Foldseek-input-model-accessions.txt: Text file listing the accession numbers for the AlphaFold models used in the structural similarity network analysis in Figure 3A.
- Figure3A-Foldseek-TMscore-network.tsv: Tab-seperated values file defining the structural similarity network in Figure 3A. "source" and "target" are Uniprot accession numbers. "tmscore" is the TMalign TM-score between the source and target structures.
- Figure3A-Foldseek-TMscore-network-PFDlayout.cys: File for visualizing the network in Figure 3A. Can be opened in Cytoscape.
- Figure3B-PSI-Blast-input-sequences.fasta: Fasta file containing all of the sequences used to generate the sequence similarity network in Figure 3B.
- Figure3B-PSI-Blast-network-iteration3.tsv: Tab-seperated values file defining the sequence similarity network in Figure 3B. "source" and "target" refer to nodes in the sequence similarity network. "eval" is the PSI-BLAST e-value. "coverage" is percent query coverage. "iteration" indicates the PSI-BLAST iteration the connection was made during. Nodes are detailed in the corresponding .cys file, below.
- Figure3B-PSI-Blast-network-iteration3-PFDlayout.cys: File for visualizing the network in Figure 3B. Can be opened in Cytoscape.
- Figure3C-A0A812K102-centered-Foldseek-alignment.afa: Query-centered sequence alignment (in the Fasta format) output from Foldseek.
- Figure3C-query-centered-Foldseek-alignments.zip: Zip file containing query-centered sequence alignments (in the Fasta format) using all of the representative protein sequences.
- Figure3D-sequence-embedding-based-conservation-scores.zip: Raw output of the sequence embedding-based conservation analysis for all of the representative proteins, as summarized in in Figure 3D.
- Figure3E-PeSTo-protein-protein-interactions.zip: Raw output of PeSTo protein-protein interaction predictions for all of the representative proteins, as summarized in Figure 3E.
- Figure3F-AF-A0A812K102-F1-model_v4.pdb: Coordinates for a 3-dimensional protein model of the A0A812K102 monomer. This can be viewed in software such as PyMOL.
- Figure3G-A0A812K102-AF2-tetramer.zip: Output of AlphaFold-multimer modeling of A0A812K102 tetramers.
Subfolder "Figure 4:"
- Figure4-sequences-and-accessions.fasta: Fasta file containing all of the sequences/accessions used for phylogenetic analyses.
- Figure4A-median-tree.nwk: Newick file defining a phylogenetic tree, corresponding to the median tree from the sequence-based phylogenetic analysis described by Figure 4.
- Figure4BC-structuralTree.nwk: Newick file defining a phylogenetic tree, corresponding to the structure-based phylogenetic analysis described by Figure 4. This tree is predicted from a pairwise distance matrix, which was derived from an all-to-all table of template modeling scores.
Subfolder "Figure S1:"
- FigureS1A-CLANS-network.tsv: Tab-seperated values file defining the sequence similarity network in Figure S1A. The "source" and "target" columns are nodes (protein sequences) defined in the file detailed below. "attraction" is the CLANS derived attraction value used for clustering.
- FigureS1A-CLANS-network-annotation.tsv: Tab-seperated values file containing the annotation for the sequence similarity network contained in the file above.
- FigureS1A-CLANS-network-PFDlayout.cys: File for visualizing the sequence similarity network. The file can be opened in Cytoscape.
- FigureS1B-Dali-network.tsv: Tab-seperated values file defining the structural similarity network in Figure S1B. "source" and "target" correspond to the first four characters in the corresponding Uniprot accession numbers. "dalizscore" is the Dali-derived structural similarity score between the source and target structures.
- FigureS1B-Dali-network-PFDlayout.cys: File for visualizing the structural similarity network. The file can be opened in Cytoscape.
- FigureS1C-data.csv: Data backing the two plots in Figure S1C. "model" corresponds to the four first characters in the corresponding Uniprot accession numbers of the proteins. "color" is the color the dots are plotted in. "DALIzscore" is the Dali-derived structural similarity score between the model and Orco. "FSevalue" is the Foldseek e-value. "FStmscore" is the Foldseek TM-align TM-score between the model and Orco.
- FigureS1C-Dali.tsv: Tab-seperated values file containing the raw output of DaliLite for Orco vs all Drosophila 7TMIC comparisons. "Chain" refers to the model, where the first four characters correspond to the first four characters of the Uniprot accession number. "Z" is the Dali Z-score. "rmsd" is the root mean square deviation of rigid body superimposition (see publicly available documentation on Dali for more detail). "lali" is the equivalent number of amino acid residues between models. "nres" is the number of residues in the Chain. "%id" is the percent amino acid sequence identity between models. "PDB" and "Description" give additional details on the model.
- FigureS1C-Foldseek-3DI+AA.m8: Output of Foldseek in 3DI+AA mode for Orco vs all Drosophila 7TMIC comparisons. "query" is the query identifier. "target" is the target identifier. "fident" is the fraction of identical matches. "alnlen" is the length of the alignment. "mismatch" is the number of mismatches in the alignment. "gapopen" is the number of gap open vents. "qstart" is the alignment start position of the query. "qend" is the alignment end psotion of the query. "tstart" is the alignment start position of the target. "tend" is the alignment end position of the target. "evalue" is the Foldseek e-value. "bits" is the bit score. "alntmscore" is the TM-score of the alignment. "qtmscore" is the TM-score normalized by query length. "ttmscore" is the TM-score normalized by target length.
- FigureS1C-Foldseek-TMalign.m8: Output of Foldseek in TMalign mode for Orco vs all Drosophila 7TMIC comparisons. "query" is the query identifier. "target" is the target identifier. "fident" is the fraction of identical matches. "alnlen" is the length of the alignment. "mismatch" is the number of mismatches in the alignment. "gapopen" is the number of gap open vents. "qstart" is the alignment start position of the query. "qend" is the alignment end psotion of the query. "tstart" is the alignment start position of the target. "tend" is the alignment end position of the target. "evalue" is the Foldseek e-value. "bits" is the bit score. "alntmscore" is the TM-score of the alignment. "qtmscore" is the TM-score normalized by query length. "ttmscore" is the TM-score normalized by target length.
- FigureS1D-Dali-results.zip: Zip file containing the raw output of DaliLite all-to-all comparisons of screen query models and negative controls.
- FigureS1D-query-models-dali.tsv: Tab-seperated values file containing all-to-all Dali Z-scores, where the unlabeled rows match the column order. The table directly matches Figure S1D, and can be mapped to the model names by this figure.
- FigureS1D-query-models-tmscore.tsv: Tab-seperated values file containing all-to-all template modeling scores, where the unlabeled rows match the column order. The table directly matches Figure S1D, and can be mapped to the model names by this figure.
- FigureS1EF-initial-Foldseek-hits.csv: Comma-seperated values file of all the hits from the first-step sequence-based screen shown in Figure S3B. The name corresponds to the Uniprot/AlphaFold accession number, with average scores in the two subsequent columns.
Subfolder "Figure S2:"
- FigureS2A-full-PSI-Blast-results.tsv: Tab-seperated values file containing the full 10 PSI-BLAST network iterations. "source" and "target" refer to nodes in the sequence similarity network. "e-value" is the PSI-BLAST e-value. "coverage" is percent query coverage. "psi-blast-iteration" indicates the PSI-BLAST iteration the connection was made during.
- FigureS2A-PSI-Blast-network-iteration1.txt: A space-seperated values file containing the first iteration of the PSI-BLAST networking. "source" and "target" refer to nodes in the sequence similarity network. "e" is the PSI-BLAST e-value. "cov" is percent query coverage. "it" indicates the PSI-BLAST iteration the connection was made during. Nodes are detailed in the corresponding .cys file, below.
- FigureS2A-PSI-Blast-network-iteration1-PFD.cys: File for visualizing the first iteration PSI-BLAST network in FigureS4A. The file can be opened in Cytoscape.
- FigureS2A-PSI-Blast-network-iteration2.txt: A space-seperated values file containing the second iteration of the PSI-BLAST networking. "source" and "target" refer to nodes in the sequence similarity network. "e" is the PSI-BLAST e-value. "cov" is percent query coverage. "it" indicates the PSI-BLAST iteration the connection was made during. Nodes are detailed in the corresponding .cys file, below.
- FigureS2A-PSI-Blast-network-iteration2-PFD.cys: File for visualizing the second iteration PSI-BLAST network in FigureS4A. The file can be opened in Cytoscape.
- FigureS2B-data.csv: Comma-seperated values file containing the embedding-based conservation scores ("CONS") and the percent amino acid sequence identity ("pid") for the amino acids encoding Symbiodium natans A0A812K102.
- FigureS2D-Orco-AF2-multimers.zip: Zip filed containing the output of AlphaFold-multimer modeling of Orco dimers, trimers, tetramers, and pentamers.
- FigureS2D-A0A1Q9PEH0-AF2-tetramer.zip: Zip filed containing the output of AlphaFold-multimer modeling of A0A1Q9PEH0 tetramers.
- FigureS2D-A0A7S0FAQ9-AF2-tetramer.zip: Zip filed containing the output of AlphaFold-multimer modeling of A0A7S0FAQ9 tetramers.
- FigureS2D-A0A7V1EQR6-AF2-tetramer.zip: Zip filed containing the output of AlphaFold-multimer modeling of A0A7V1EQR6 tetramers.
- FigureS2D-M0LGF2-AF2-tetramer.zip: Zip filed containing the output of AlphaFold-multimer modeling of M0LGF2 tetramers.
- FigureS2D-Q23F73-AF2-tetramer.zip: Zip filed containing the output of AlphaFold-multimer modeling of Q23F73 tetramers.
- FigureS2D-additional-AF2-tetramers.zip: Zip filed containing the output of AlphaFold-multimer modeling of tetramers for A0A1Q1NIN5, A0A1S3Z4L6, A0A6H5J1L1, A0A6J3L5I1, A0A524QPV7, A0A653BQM0 and A0A811TD67.
Subfolder "Figure S3:"
- FigureS3B-sequences.fasta: Fasta file containing all the sequences used for sequence-based phylogenetic analyses before removal of rogue taxa.
- FigureS3B-MUSCLE5-alignments.zip: Zip file containing fasta files that contain full length Muscle5-derived multiple sequence alignments used for the sequence-based phylogenetic analyses before removal of rogue taxa (see manuscript).
- FigureS3B-alignments-trimmed.zip: Zip file containing fasta files with trimmed multiple sequence alignments, corresponding to the file above.
- FigureS3B-phylogenetic-trees.zip: Zip file containing newick files that encode phylogenetic trees inferred from the alignments in the file above.
- FigureS3B-majority-consensus-tree.nwk: Newick file containing the majority consensus tree summarizing the phylogenetic trees packaged in the file above.
- FigureS3CDE-sequences.fasta: Fasta file containing all the sequences used for sequence-based phylogenetic analyses after removal of rogue taxa.
- FigureS3CDE-MUSCLE5-alignments.zip: Zip file containing fasta files that contain full length Muscle5-derived multiple sequence alignments used for the sequence-based phylogenetic analyses after removal of rogue taxa (see manuscript).
- FigureS3CDE-alignments-trimmed.zip: Zip file containing fasta files with trimmed multiple sequence alignments, corresponding to the file above.
- FigureS3CDE-phylogenetic-trees.zip: Zip file containing newick files that encode phylogenetic trees inferred from the alignments in the file above.
- FigureS3C-majority-consensus-tree.nwk: Newick file containing the majority consensus tree summarizing the phylogenetic trees packaged in the file above.
- FigureS3E-cluster-consensus-trees.zip: Zip file containing newick files that containing the consensus trees of clusters visualized in Figure S3D.
- FigureS3FG-Foldtrees.zip: Zip file containing the newick files that contain the MAD-rooted structural phylogenies generated by Foldtree.