Insights from the timber rattlesnake (<em>Crotalus horridus</em>) genome for MHC gene architecture and evolution in threatened rattlesnakes
Data files
Jan 02, 2025 version files 1.19 MB
-
CrAd_MHC_annotations.gff
23.70 KB
-
CrHo_MHC_annotations.gff
14.40 KB
-
CrOr_MHC_annotations.gff
17.60 KB
-
MHCI_4species_aligned.phy
250.04 KB
-
MHCIIB_4species_aligned.phy
151.87 KB
-
README.md
10.27 KB
-
Reptile_MHC_Database.fasta
707.51 KB
-
SiCa_MHC_annotations.gff
10.52 KB
Abstract
Conservation of threatened species can benefit from an evaluation of genes in the Major Histocompatibility Complex (MHC), whose loci encode proteins that bind pathogens and are often under strong selection to maintain diversity in immune response to diseases. Despite this gene family’s importance to disease resistance, little is known about these genes in reptiles including snakes. To address this issue, we assembled and annotated a highly-contiguous genome assembly for the timber rattlesnake (Crotalus horridus), a pit viper which is threatened/endangered in parts of its range, and analyzed this new genome along with three other rattlesnake genomes to characterize snake MHC loci. We identified highly-duplicated MHC class I and class IIβ genes in all species typified by a genomic architecture of discrete gene clusters localized on chromosome 2. Number of loci varied between species from 14 to 23 for MHC I and from 8 to 32 for MHC IIβ and was greater than previously identified in the few non-genome-based studies of reptile MHC to date. We present evidence of the gene family’s complex evolutionary history, with extensive duplication and loss concurrent with speciation resulting in incomplete lineage sorting. The differences in gene number between species combined with a dynamic evolutionary history suggests that gene family expansion/contraction via rapid duplication/gene loss may represent an important mechanism for generating genetic diversity in rattlesnake MHC. Our work demonstrates the utility of whole genome sequences for identifying functional genetic variation in the form of MHC genes relevant for conservation genomic studies in threatened snakes.
README: Insights from the timber rattlesnake (Crotalus horridus) genome for MHC gene architecture and evolution in threatened rattlesnakes
https://doi.org/10.5061/dryad.z612jm6nd
Description of the data and file structure
This dataset contains files produced and used for analyses in Roseman et al. to characterize major histocompatibility (MHC) genes in several rattlesnake species. While intermediate files can be generated from the code available at https://github.com/marissaroseman/Rattlesnake_MHC using genome data published on NCBI (see Supplemental Table S3 for GenBank accession numbers). Major files for replicating the analysis or conducting similar studies are included here. These include:
A database of MHC coding sequences annotated in other reptile species (Reptile_MHC_Database.fasta)
Sequence alignments of MHC I and IIB genes for the four studied rattlesnake species (MHCI_4species_aligned.phy and MHCIIB_4species_aligned.phy)
Annotations for putative MHC coding sequences for the four studied rattlesnake species (CrAd_MHC_annotations.gff, CrHo_MHC_annotations.gff, CrOr_MHC_annotations.gff, SiCa_MHC_annotations.gff)
Files and variables
File Name: Reptile_MHC_Database.fasta
File Format: FASTA
This file follows standard fasta format to provide sequences annotated as MHC genes in other reptile species. Each sequences is represented by two lines, where the first includes the species name, gene number from the annotation, and class of MHC gene (MHCI or MHCIIB) in the format of >Species.number_class . The second line is the nucleotide sequence.
Species included and corresponding GenBank accessions are:
Python bivittatus (Burmese python) GCA_000186305.2
Zootoca vivipera (common lizard) GCA_011800845.1
Pantherophis guttatus (corn snake) GCA_001185365.2
Thamnophis elegans (Western terrestrial garter snake) GCA_009769535.1
Anolis carolinensis (Green anole) GCA_000090745.2
Naja naja (Indian cobra) GCA_009733165.1
Ophiohagus hannah (king cobra) GCA_000516915.1
Varanus komodoensis (Komodo dragon) GCA_004798865.1
Protobothrops mucrosquamatus (brown-spotted pit viper) GCA_001527695.3
Podarcis muralis (common wall lizard) GCA_004329235.1
Lacerta agilis (sand lizard) GCA_009819535.1
Crotalus tigris (tiger rattlesnake) GCA_016545835.1
Code used for generating this dataset can be found at https://github.com/marissaroseman/Rattlesnake_MHC.
Briefly, we downloaded the reference genomes and annotations for each species on NCBI using the 0_curl_reptile_genomes.sh script, then used grep and awk commands to extract annotations with MHC-related keywords from the annotation gff with 1_process_gff.sh. Then we ran a custom script to extract the sequence for the genes of interest using the reference genome and the annotations of interest, and saved this as the fasta file. The script 2_extract_gff.out calls and runs Extract_gff_feature_v0.2.py. The fasta outputs for all included reptile species were then concatenated into Reptile_MHC_Database.fasta.
File names: MHCI_4species_aligned.phy and MHCIIB_4species_aligned.phy
File format: PHYLIP (.phy)
These files follow standard PHYLIP format to represent a multiple sequence alignment of putative MHC coding sequences across the four studied rattlesnake species. On the first line, the first number is the number of sequences, and the second number is the characters (A/T/C/G bases and dashes for gaps) in the alignment. All subsequent lines are the aligned sequences, one per line, preceded by the sequence identifier. The sequence identifier consists of the first two letters of the genus and species name, the type of MHC gene (MHC I or MHC IIB) and a sequence number that counts up to the number of genes identified in each species, starting with one. The species represented are Crotalus horridus (CrHo), Crotalus adamanteus (CrAd), Crotalus oreganus (CrOr) and Sistrurus catenatus (SiCa). MHC I and MHC IIB genes are aligned separately in their corresponding files.
Sequences included are the final retained coding sequences produced by multiple steps of gene identification and validation, as described in Roseman et al. Scripts associated with this process are available at https://github.com/marissaroseman/Rattlesnake_MHC, though much of the annotation pipeline involves manual curation. Briefly, for each studied rattlesnake species, we used the 3_run_mhc_annotation.sh script to run ToxCodAn-Genome (Nachtigall et al. 2024) to identify putative MHC sequences in reference genomes using the Reptile_MHC_Database.fasta file. We manually assessed the resulting annotations to filter out those with low confidence matches to MHC genes on the NCBI BLAST database or those with internal or no stop codons. We used FGENESH+ from* http://www.softberry.com to survey genomic regions identified as likely to contain MHC sequences by ToxCodAn-Genome. FGENESH+ requires as input a protein sequence to search for, so we used 4_get_proteins.sh* to generate protein sequences in fasta format corresponding to the reptile coding sequence that each rattlesnake target genomic region matched to. It also requires the nucleotide sequences to query, which we extracted in fasta format for the reptile genomic regions of interest with the 5_get_matched_regions.sh script. We manually assessed the resulting annotations from FGENESH+ as described above for the raw ToxCodAn-Genome annotations. We used the 6_fgenesh2gff.sh script to convert from FGENESH+ webpage text to a formatted gff and correct the genomic coordinates to be in reference to the genomic scaffolds rather than the nucleotide query sequence. For studied rattlesnake species with an available whole-genome annotation file, we extracted coding sequences previously annotated as MHC with 1_process_gff.sh. All sources of putative genes – ToxCodAn-Genome, FGENESH+, and whole-genome annotation files were imported into GeneiousPrime as gffs with the reference sequence for manual curation, including confirmation of a signal peptide, alignment to homologous sequences, and other characteristics of expected MHC genes. Retained and finalized sequences for each gene type were translation-aligned in Geneious with MUSCLE to produce the MHCI_4species_aligned.phy and MHCIIB_4species_aligned.phy files here.
File Names: CrAd_MHC_annotations.gff, CrHo_MHC_annotations.gff, CrOr_MHC_annotations.gff, SiCa_MHC_annotations.gff
File Format: General Feature Format 3 (GFF3)
These files follow standard GFF format for annotation features with the following tab-separated columns:
- Scaffold/Chromosome name
- Program used to generate the feature (ToxCodAn for annotations generated with ToxCodAn-Genome, fgenesh for annotations requiring searching a genomic region with FGENESH+, and maker for annotations taken from the whole-genome annotation gff)
- Feature type (In this case, all are CDS for coding sequence)
- Start position of the feature (relative to start of the scaffold/chromosome)
- End position of the feature (relative to start of the scaffold/chromosome)
- Score (sometimes used to represent sequence similarity, but often left “.” to represent “undefined”
- Strand (+ or -, representing orientation of the sequence relative to the start of the scaffold/chromosome)
- Phase relative to the reading frame (values of 0, 1, or 2 representing which position in a codon the feature begins at, or “.” to denote “undefined”)
- Attribute(s) providing information about the feature (formatted as tag=value, e.g. “ID=MHCI”. Multiple attributes must be separated by semicolons)
Each studied rattlesnake species has its own file containing all our MHC CDS annotations for the species. As in the 4species_aligned.phy files, gene names are given as the first two letters of the genus and species name, the type of MHC gene (MHC I or MHC IIB) and a sequence number.
These GFF files were produced by the process to generate final annotations in GeneiousPrime as described in Roseman et al. and above. Then for each species, the CDS annotations were exported in GFF3 format.
Code/software
Code associated with generating the MHC annotations is available at https://github.com/marissaroseman/Rattlesnake_MHC.
ToxCodAn-Genome (Nachtigall et al. 2024) is available at https://github.com/pedronachtigall/ToxCodAn-Genome.
FGENESH+ is available to run online (with limited daily searches) at http://www.softberry.com/berry.phtml?topic=fgenes_plus&group=programs&subgroup=gfs
GeneiousPrime requires a paid license subscription, but free trials are available at geneious.com.
Access information
Reference genomes and annotations used are publicly available from GenBank, as outlined below:
Species | GenBank accession |
---|---|
Crotalus adamanteus | GCA_039797435.1 |
Sistrurus catenatus | GCA_037127405.1 |
Crotalus oreganus | GCA_024509115.1 |
Anolis carolinensis | GCA_000090745.2 |
Python bivittatus | GCA_000186305.2 |
Ophiohagus hannah | GCA_000516915.1 |
Crotalus tigris | GCA_016545835.1 |
Naja naja | GCA_009733165.1 |
Protobothrops mucrosquamatus | GCA_001527695.3 |
Pantherophis guttatus | GCA_001185365.2 |
Thamnophis elegans | GCA_009769535.1 |
Zootoca vivipera | GCA_011800845.1 |
Podarcis muralis | GCA_004329235.1 |
Lacerta agilis | GCA_009819535.1 |
Varanus komodoensis | GCA_004798865.1 |