Immune-associated orthologous genes across primates
Data files
Dec 24, 2025 version files 7.89 GB
-
All_immune_prot_genes.txt
139.57 KB
-
data.zip
2.64 GB
-
Immune_Orthogroup_primate.txt
9.80 MB
-
Orthogroup_primate.txt
31.28 MB
-
Primate_codon.fasta
5.21 GB
-
Primate_newick.txt
1.97 KB
-
README.md
3.93 KB
Abstract
The immune system acts as a bridge between pathogenic microorganisms and their hosts. Despite its significance, to date, the evolutionary mechanisms of immune system complexity are largely not well understood in primates. In this study, we applied a newly developed strategy (TOGA pipeline) to generate a robust one-to-one orthologous gene catalog including 15,057 protein-coding genes accounting for ~78% of the annotated gene set of the human genome (hg38) and representing ~ 28.9 Mbp (coding sequence), across 50 primate species and two closest outgroups (Malayan flying lemur and Chinese tree shrew). Deeply, to obtain a comprehensive immune-associated gene catalog across primates, we investigated six known public datasets (GO, KEGG, InnateDB, Immport, HPO and Enard et al., 2016) and found that 4,744 of 5,635 immune-associated genes were covered in our constructed one-to-one orthologous catalog across 52 species.
https://doi.org/10.5061/dryad.6q573n65z
The immune system mediates interactions between pathogens and their hosts, yet the evolution of its complexity in primates remains poorly understood. Here, we used the TOGA pipeline to build a high-confidence one-to-one orthologous gene catalog of 15,057 protein-coding genes (~78% of the human hg38 gene set; ~28.9 Mb coding sequence) across 50 primate species and two close outgroups. By integrating six public immune-gene resources, we further identified 4,744 immune-associated genes represented in this orthologous catalog across 52 species.
Description of the data and file structure
Primate_codon.fasta: coding sequences predicted by TOGA software across 52 species.
Orthogroup_primate.txt: one-to-one orthologous gene catalog including 15,057 protein-coding genes.
Immune_Orthogroup_primate.txt: one-to-one orthologous gene catalog including 4,744 immune-associated genes.
All_immune_prot_genes.txt: 5,635 immune-associated genes collected from six known public datasets.
Primate_newick.txt: primate species tree.
data.zip: a compressed archive containing data required to run the scripts available in the associated GitHub repository:
https://github.com/zhangxp1993/Immune-associated-orthologous-genes-across-primates/.
The folder structure for data.zip is as follows:
data.zip
├── free_immune
├── hg38_cds.fasta
├── human.longest.transcript.bed
├── mfree_ctl
├── Primate_genome.nwk
├── species_list
├── toga.isoforms.tsv
└── TOGA_results
├── Aotus_nancymaae
│ ├── codonAlignments.fa
│ └── orthologsClassification.tsv
├── Ateles_geoffroyi
│ ├── codonAlignments.fa
│ └── orthologsClassification.tsv
├── Callithrix_jacchus
│ ├── codonAlignments.fa
│ └── orthologsClassification.tsv
├── Cebus_albifrons
│ ├── codonAlignments.fa
│ └── orthologsClassification.tsv
└── ...
The ellipsis (...) indicates additional primate species, each following the same directory and file structure.
File and directory descriptions
- free_immune/
Lists of file paths for each immune-associated orthologous gene used in downstream analyses. - hg38_cds.fasta
Coding DNA sequences (CDS) of human genes based on the hg38 reference genome. - human.longest.transcript.bed
BED-format annotation of the longest transcript for each human gene. - mfree_ctl
Configuration file used to run PAML (codeml) analyses. - Primate_genome.nwk
Newick-format phylogenetic tree of the primate species included in this study. - species_list
List of primate species analyzed. - toga.isoforms.tsv
Mapping table linking human genes to their corresponding transcript isoforms used by TOGA. - TOGA_results/
Output directories from TOGA analyses, organized by species.
TOGA_results species-level directories
Each species directory (e.g., Aotus_nancymaae, Ateles_geoffroyi, Callithrix_jacchus, Cebus_albifrons, etc.) contains:
- codonAlignments.fa
Codon-based sequence alignments corrected for frameshifting insertions and deletions. - orthologsClassification.tsv
Table describing orthology relationships with the following columns:t_gene: Gene name in the reference (human)t_transcript: Transcript identifier in the referenceq_gene: Gene identifier in the query speciesq_transcript: Transcript identifier in the query speciesorthology_class: Orthology relationship class (e.g., one2one, one2many, many2one, many2many, one2zero)
