Pervasive relaxed selection on spermatogenesis genes in gorillas coincident with the evolution of polygyny
Data files
Nov 06, 2023 version files 12.56 GB
-
261Mammals.consensus.tree
203.68 KB
-
Alignments.tar
5.96 GB
-
gene_correspondence.txt
589.99 KB
-
Gorilla_HyPhy_PolyPhen-2.zip
6.60 GB
-
README.md
3.60 KB
Abstract
Gorillas have a polygynous social system in which the highest-ranking male has almost exclusive access to females and sires most of the offspring in the troop. Such behavior results in a dramatic reduction of sperm competition, which is ultimately associated with numerous traits leading to a low efficacy of gorilla spermatogenesis. However, the molecular basis behind the remarkable erosion of the gorilla male reproductive system remains unknown. Here, we explored the genetic consequences of the polygynous social system in gorillas by testing for altered selection intensity across 13,310 orthologous protein-coding genes from 261 Eutherian mammals. We identified 578 genes with relaxed purifying selection in the gorilla lineage, compared with only 96 that were positively selected. Genes under relaxed purifying selection in gorillas have accumulated numerous deleterious amino acid substitutions; their expression is biased towards male germ cells, and are enriched in functions related to meiosis and sperm biology. We tested the function of gorilla relaxed genes previously not implicated in sperm biology using the Drosophila model system and identified 41 novel spermatogenesis genes required for normal fertility. Furthermore, by exploring exome/genome sequencing data of infertile men with severe spermatogenic impairment, we found that the human orthologs of the gorilla relaxed genes are enriched for loss-of-function variants in infertile men. These data provide compelling evidence that reduced sperm competition in gorillas is associated with relaxed purifying selection on genes related to male reproductive function. The accumulation of deleterious mutations in these genes likely provides the mechanistic basis behind the low efficacy of gorilla spermatogenesis and uncovers new candidate genes for human male infertility.
README: Pervasive relaxed selection on spermatogenesis genes in gorillas coincident with the evolution of polygyny
https://doi.org/10.5061/dryad.5dv41nsbc
Alignments of protein coding genes and HyPhy json output files (which are viewable by uploading individual files to HyPhy Vision http://vision.hyphy.org).
Description of the data and file structure
Pervasive relaxed selection on spermatogenesis genes in gorillas coincident with the evolution of polygyny
Generated with HyPhy v. 2.5.32 (FEL, RELAX, RELAX scan) and 2.5.36(aBSREL analysis)
PolyPhen2 WebServer Available at http://genetics.bwh.harvard.edu/pph2/
├── Alignments.tar -13491 alignments in fasta format of 261 mammals -Generated by method in manuscript
├── 261Mammals.consensus.tree -Treefile used for the 261 mammals and analysis with RELAX and aBSREL
├── gene_correspondence.txt -File with gene number of the alignments in the alignment file and the corresponding ENSEMBL ID
└── Gorilla_HyPhy_PolyPhen-2.zip
├── ABSREL
│ ├── completed_absrel_DH_no_srv_all.tar.gz -ABSREL with Double Hit and No SRV in the analysis
│ ├── completed_absrel_DH__srv_all.tar.gz -ABSREL with Double Hit and SRV in the analysis
│ ├── completed_absrel_DTH_no_srv_all.tar.gz -ABSREL with Double and Triple Hit and No SRV in the analysis
│ ├── completed_absrel_dth_srv_all.tar.gz -ABSREL with Double Triple Hit and SRV in the analysis
│ ├── completed_absrel_DH_no_srv_all.tar.gz -ABSREL with Double Hit and No SRV in the analysis
│ ├── completed_absrel_no_srv_all.tar.gz -ABSREL with no Multiple Hit and No SRV in the analysis
│ └── completed_absrel_srv_all.tar.gz -ABSREL with no Multiple Hit and SRV in the analysis
├── FEL
│ ├── JSON
│ │ └── All JSON files generated by FEL analysis by HyPhy
│ ├── SCRIPTS
│ │ └── Python Scripts used to do analysis on the FEL
│ └── Summary
│ └── Textfile summary of the FEL Results
├── PolyPhen2: PolyPhen2 Scores of mutating residues from a predicted ancestral reconstruction of ancient hominidae to gorilla.
│ ├── ALL_PP2_SCORES_NOT_SIGNIFICANT_RELAX_OR_INTENS.TXT - PP2 Scores of all gorilla-specific mutations
│ ├── ALL_PP2_SCORES_SIGNIFICANT_INTENS.TXT - PP2 Scores of all gorilla-specific mutations of genes undergoing intensified selection
│ ├── ALL_PP2_SCORES_SIGNIFICANT_RELAX.TXT - PP2 Scores of all gorilla-specific mutations of genes undergoing relaxed selection
│ ├── ANC_HOMINIDAE_PROTEIN_SEQUENCES.fasta - Predicted ancestral reconstruction of ancient hominidae sequences
│ ├── ANC_HOMINIDAE_GORILLA_POLYPHEN2_SUBMIT.txt - File used directly for PolyPhen2 webserver
│ ├── FIGURES
│ │ └── Plots generated from the PolyPhen2 data
│ ├── Polyphen2_Gorilla_Results.txt - Output file generated by PolyPhen2 webserver
│ └── Scripts
│ └── Python Scripts used to extract data from PolyPhen2 files
└── RELAX
├── RELAX
│ └── JSON files generated by HyPhy Relax of Gorilla
└── RELAX_scan
└── JSON files generated by HyPhy Relax_Scan of Gorilla
Methods
Generation of multiple sequence alignments
The alignment pipeline starts with Ensembl v99 human coding sequences, using the longest isoforms for each gene. We then search for the orthologs of these human coding sequences within the genome assemblies of 260 other mammals with a contig size greater than 30kb present in the NCBI assembly database as of July 2020. We select a minimum 30kb contig size to avoid an excessive number of truncated orthologous coding sequences. To extract the orthologous coding sequences (CDS) to the human CDS, we use best Blat reciprocal hits from the human CDS to each other mammalian genome, and back to the human genome. We use Blat matching all possible reading frames, with a minimum identity set at 30% and the “fine” option activated (Kent, 2002). We excluded genes with less than 251 best reciprocal hits out of the 261 (human+other mammals) species included in the analysis. In total, we find 13,491 human CDS genes with best reciprocal hits orthologs in at least 250 other mammals.
We aligned each orthologous gene with Macse v2 (Ranwez et al., 2018). Macse v2 is a codon-aware multiple sequence aligner that can explicitly identify frameshifts, and readjusts reading frames accordingly. This feature of Macse is particularly important because erroneous indels introduced in coding sequences during genome sequencing and assembly process can be common and can cause frameshifts that many aligners do not take into account. These sequencing and alignment errors can result in substantial misalignments of coding sequences due to incomplete codons. Note that we start from the human CDS because they are likely to be of the highest quality in terms of sequencing and annotations. We use Macse v2 with maximum accuracy settings.
The alignments generated by Macse v2 were edited by HMMcleaner with default parameters (Franco et al., 2019). HMMcleaner is designed to remove, in a species-specific fashion, “fake” substitutions that are likely genome sequencing errors. HMMcleaner also removes “false exons” that might have been introduced during the Blat search. “Fake exons” are intronic segments that by chance have some similarity with exons that are missing from an assembly due to a sequencing gap. When looking for the most similar non-human CDS using Blat, Blat can sometimes “replace” the missing exon with a similar intronic segment.
After using HMMcleaner to remove suspicious parts of the CDS alignments in each species, we only select those codons that are still complete to remain in the alignments. In alignments of coding sequences, the flanks of indels usually include a higher number of misaligned substitutions. In each separate species, we further remove from the alignments the x upstream or downstream codons if more than x/2 of these codons code for amino acids that are different from the consensus amino acid in the whole alignment, with x varying from 1 to 20. For example, if eight of the 12 amino acids just on the left side of an indel are different from the consensus in a given species, we remove the 12 corresponding codons in that species. If two of the three amino acids just on the right side of an indel are different from the consensus in a given species, we remove the three corresponding codons in that species. Visual inspection shows that the 13,491 resulting orthologous CDS alignments are of very high quality, with a very low number of visibly ambiguous or erroneous segments.
To build the species phylogenetic tree of the 261 selected mammals, we use the third (4,444) of the 13,491 orthologous CDS alignments with the lowest GC content. CDS with a high GC content have been shown to give poor tree consensus in mammals due to a higher number of homoplasies than in AT-rich CDS, caused by more frequent Biased Gene Conversion and CpG hypermutable sites (Romiguier et al., 2013). We use IQTREE2 (Nguyen et al., 2015) to generate the consensus, maximum likelihood tree from the 4,444 AT rich CDS, with a GTR substitution model with six parameters (GTR-6) for each CDS (Kalyaanamoorthy et al., 2017; Nguyen et al., 2015; Romiguier et al., 2013) The GTR-6 model was the best fit for more than 85% of the 4,444 genes in a preliminary comparison of different substitution models.
ABSREL branch-site selection tests
Methods to infer the strength and direction of selection acting on molecular sequences, such as those to identify pervasive and episodic adaptive evolution (“positive selection”), are often based on estimating the ratio of non-synonymous (dN) to synonymous (dS) substitution rates (dN/dS or ?) in an alignment of homologous genes. These methods can be applied to entire genes or regions within genes (Hughes and Nei, 1988), sites (Nielsen and Yang, 1998), branches (Messier and Stewart, 1997; Muse and Gaut, 1994; Yang, 1998), or sites along a specific branch - the latter often called “branch-site” (BSREL) models (Pond et al., 2011; Martin D. Smith et al., 2015; Yang and Nielsen, 2002). A key advantage of the BSREL models is that they allow for a group of sites to evolve under the action of positive selection, i.e., with ?>1, while the remaining sites can evolve under purifying selection (?<1) or relatively neutrally (?=1) in specific branches, and thus can detect positive selection that is episodic (limited to a subset of branches) and restricted to a subset of sites in a gene. Positive selection is inferred when a class of sites is identified with ?>1, with a LRT P-value ≤ 0.05. Subsequently, these methods have been modified (ABSREL) to allow for the number of dN/dS rate categories to be inferred from the alignment rather than imposed a priori (Martin D. Smith et al., 2015), as well as to accommodate both synonymous rate variation across sites [S] (Pond and Muse, 2005; Wisotsky et al., 2020) and multi-nucleotide mutations per codon [MH] (Lucaci et al., 2021). In our implementation of the ABSREL test, the gorilla lineage is defined as the foreground (“test”) lineage and all other lineages are assigned to the background set. We tested for selection using the base ABSREL model, a model that accounts for synonymous rate variation across sites (ABSREL[S]), a model that accounts for multi-nucleotide mutations per codon (ABSREL[MH]), and a model that accounts for both synonymous rate variation across sites and multi-nucleotide mutations per codon (ABSREL[SMH]). Positive selection is inferred when the best fitting ABSREL model, determined from by a small sample AIC test, includes a class of sites with ?>1 at a likelihood ratio test (LRT) P-value ≤ 0.05.
RELAX selection tests
Standard methods for estimating the strength and direction of selection acting on molecular sequences based on estimating the ratio of non-synonymous (dN) to synonymous (dS) substitution rates (dN/dS or ?) are often not suitable for detecting relaxed purifying selection because they lack power for this test and can mistake an increase in the intensity of positive selection for relaxation of both purifying and positive selection (Wertheim et al., 2015). For example, if most sites within a protein coding gene experience purifying selection (dN/dS < 1) while only a subset of sites experience an episode of relaxation in the intensity of purifying selection (dN/dS ≈ 1), the average dN/dS across sites will likely still be below one (Wertheim et al., 2015). A relaxation in the intensity of selection may also be associated with the release of some sites from the action of diversifying selection, shifting these sites from dN/dS > 1 to dN/dS ≤ 1. Again, the average dN/dS across sites will likely still be below one and the episode of reduced selection intensity masked.
To overcome these limitations, we used a general hypothesis testing framework for detecting episodic relaxed selection intensity that utilizes codon-based maximum-likelihood models that are an extension of the BSREL model (RELAX) (Wertheim et al., 2015). The RELAX method is a variant of the branch-site random effects likelihood (BS-REL) model (Martin D Smith et al., 2015), which allows each lineage in a phylogeny to have a class of sites that evolve near neutrality (dN/dS ≈ 1), a class of sites that evolve under diversifying selection (dN/dS > 1), and a class of sites that evolve under purifying selection (dN/dS < 1). The RELAX test is based on the differential effects that reduced selection intensity is expected to have on these site-specific dN/dS rates: When selection is relaxed, smaller dN/dS values increase toward 1, whereas dN/dS values above 1 decrease (Wertheim et al., 2015). In the context of BSREL models, this trend can have two different effects: The dN/dS values inferred for the selection categories can move toward 1 and/or the proportions of sites belonging to the different classes can change in such a way that more sites are assigned to categories with dN/dS values closer to 1. To test whether selection is relaxed (or intensified) on a subset of a priori defined branches relative to the other branches, RELAX compares a null model in which a selection intensity parameter (K) is set to 1 for all branches (equivalent to the standard BSREL mode with three rate classes) to an alternative model in which the test branches share a K value inferred from the data. A LRT using the standard χ2 asymptotic distribution with 1 degree of freedom is used to determine significance: If the null model is rejected by the LRT, it indicates that selection on the test branches is intensified (K > 1) or relaxed (K < 1) compared with the reference branches (Wertheim et al., 2015). In our implementation of the RELAX test, the gorilla lineage is defined as the foreground (“test”) lineage and all other lineages are assigned to the background (“reference”) set.