The integrin-mediated adhesive complex in the ancestor of animals, fungi, and amoebae
Brown, Matthew et al. (2021), The integrin-mediated adhesive complex in the ancestor of animals, fungi, and amoebae, Dryad, Dataset, https://doi.org/10.5061/dryad.gxd2547jk
Integrins are transmembrane receptors that activate signal transduction pathways upon extracellular matrix binding. The integrin-mediated adhesive complex (IMAC) mediates various cell physiological processes. Although the IMAC was thought to be specific to animals, in the past ten years these complexes were discovered in other lineages of Obazoa, the group containing animals, fungi, and several microbial eukaryotes. Very recently, many genomes and transcriptomes from Amoebozoa (the eukaryotic supergroup sister to Obazoa), other obazoans, and orphan protist lineages, the eukaryotes’ closest prokaryotic relatives, have become available. To increase the resolution of where and when IMAC proteins exist and have emerged, we surveyed these newly available genomes and transcriptomes for the presence of IMAC proteins. Our results highlight that many of these proteins appear to have evolved earlier in eukaryote evolution than previously thought and that co-option of this apparently ancient protein complex was key to the emergence of animal-type multicellularity. The role of the IMACs in amoebozoans is unknown, but they play critical adhesive roles in at least some unicellular organisms.
IDENTIFICATION OF AMOEBOZOAN INTEGRINS
Canonical IMAC proteins and their protein architecture
We selected a model organism IMAC proteins from NCBI: ITGA5:P08648, ITGB1:NP_002202, Talin:AAF27330, Parvin:AAH16713, PINCH:NP_060450.2, vinculin:AAH39174, FAK:AAA35819 Paxillin:AAC50104 ILK:NP_001014794, a-actinin:AAC17470, Filamin:AAF72339 and Tensin:AAG33700. InterProScan 5.27-66.0 was used to determine these IMAC protein domain architecture along with SignalIP v 5.0  and TmHmm v 2.0 . Meme-suite 5.0.4 was used to examine integrin motifs. DeepLoc v 1.0 was used to determine the subcellular localization of integrin proteins. We set the minimum criteria of canonical IMAC proteins based on their architecture and motifs. We used OrthoMCL to assign IMAC proteins their own ortholog numbers.
Creating a novel ortholog database
Since OrthoMCL database is heavily metazoan biased, we created an ortholog database with eukaryotes listed in resources table. A modified version of OrthoMCL-DB  was used to create a novel ortholog database using the above listed transcriptomes and genomes as well as the whole OrthoMCL DB v5. To do this, an all-against-all Blast using Diamond-BlastP was conducted using each protein from the above data as queries. The Diamond-BlastP collected up to 1,000 hits with e-value of 1e-5 as a cutoff for putative homology. Blast results were clustered using the OrthoMCL pipeline methodology .
Obtaining Amoebozoan Integrin Orthologs using BlastP
BlastP was used to search for novel amoebozoan IMAC proteins through the new ortholog database and using the previous amoebozoan ortholog protein as a query. Using a modified custom pipeline was used to collect a novel IMAC orthologs. These methods generated various orthologs with a similar protein architecture. We created a FASTA file of all novel eukaryote integrin proteins based on their protein architecture. Clustering of these integrin proteins was performed by CD- Hit , which used the condition of 0.95 global sequence identity. From CD-Hit output, we used Diamond  to perform All-vs-ALL blast with database size of 50 million sequences. Consequently, a custom script of Markov cluster (MCL) algorithm was used to cluster the output of Diamond. A custom Python script was used to eliminate any MCL clustered integrin proteins that were less than 500 AA. We used a custom Python pipeline to collects ortholog sequences that clustered to model metazoan IMAC orthologs listed above.
Manually assembled contigs in Sequencher
Where necessary due to fragmentation, contigs of our automated assembly as above from each gene of interest were blasted (BlastN) back to the raw nucleotide data and to the assembly for each transcriptome. Hits were collected and assembled using Sequencher v 5.4.6 (GeneCodes, Madison, WI, USA). The taxa which we had to take this approach were Amphizonella sp. 2, Centropyxis aerophyla, Goncevia foncevia and Hyalosphenia papilio, Pellita catalonica for ITGA, and Nebela sp. for both integrin proteins.
INTEGRIN PROTEIN ANALYSIS
PFam ID of Integrins
Sometimes, we fail to capture integrin proteins even after all these strenuous processes. Therefore, Pfam ID of Integrins was used to grab any putative integrins from Transdecoder output. For ITGA, we used FG-GAP (PF01839), and for ITGB we used PSI_integrin (PF17205), integrin_beta (PF00362), integrin_b_cyt (PF08725) and integrin B tail (PF07965). We disregarded any protein sequences with PFam ID lower than e-value of 1e-10. The identity of the putative integrins were confirmed by InterProScan, SignalIP, TmHmm, DeepLoc, and Meme-suite.
Examining integrin motifs with Meme-Suite
For each putative integrin protein, we examined proteins motifs with Meme-Suite, and always blasted our proteins of interest using BlastP against NCBI’s GenBank NR database to check for possible contaminations. For integrin proteins with novel domains, the read depth of these transcripts was examined with Rsem (Table S3). The final amoebozoan integrin architectures were compared with metazoan integrin proteins (Figure S6). A custom Python script was used to create a presence/absence binary (1,0) matrix of IMAC proteins across Amoebozoa. A custom R script were used to create a heatmap.
In our current study these data are transcriptomic in nature and by this virtue some of the predicted transcript may be truncated or missing. However, identifying full-length transcripts from short reads from Illumina technology can be difficult because assembled transcript can be fragmented or incomplete due to alternative splice sites and untranslated regions . Therefore, it is possible that some transcripts are not fully assembled, and therefore their corresponding predicted protein is truncated, thus the transmembrane region and signal peptide region may be missing. The same can be said for the absence of these proteins within our data, because transcriptomes contain only genes being expressed at the time mRNA was harvested some genes within an organism’s genome may be missed. None-the-less, our methodology permitted the unparalleled discovery of these proteins within a whole supergroup, highlighting the utility of the method.
CONFIRMATION OF AMOEBOZOAN INTEGRINS
Presence of integrin in the Mastigamoeba genome
To confirm the presence of these genes in the genome, we blasted the integrin transcripts of M. balamuthi to the whole shotgun genome data (CBKX00000000; https://www.ncbi.nlm.nih.gov/nuccore/CBKX000000000.1) on NCBI . The common splicing sites were for searched in integrins intron and exon boundaries by assembling the integrin genomic contig and integrin transcript in Sequencher v 5.4.6. The genome contigs of which contained integrin gene were searched for additional ORFs to examine the phylogenetic affinity of other ORFs on the genomic contigs. We inferred maximum likelihood phylogenetic tree of tenascin (adjacent to the ITGB gene on CBKX010020318.1) and PA14 (adjacent to the ITGB gene on CBKX010020319.1), we used same conditions as the integrin phylogenetic trees (see below).
To look for evolutionary relationships of ITGA and ITGB within Amorphea, protein sequences were aligned by Mafft-Linsi with the parameters “--maxiterate 1000” and “--local pair”. Ambiguous sites were trimmed from the alignments using Bmge  by gap penalty of 0.8 settings. Maximum likelihood (ML) trees were inferred from these trimmed alignments in IQtree v 1.5.5 under the LG model with the C60 series model of site heterogeneity. Each tree is ML bootstrapped (MLBS) by 1,000 pseudoreplicates. From the resultant tree we used a custom Python script implementing ETE-Toolkit (www.etetoolkit.org) to map protein domain architecture (IPRSCAN domain IDs) onto the tree.
National Science Foundation, Award: 1456054
Division of Environmental Biology