Data from: Convergent rates of protein evolution identify novel targets of sexual selection in primates
Data files
Oct 19, 2023 version files 50.36 MB
-
23species80percent.txt
12.27 MB
-
BodyMassData.csv
2.02 KB
-
BodySizeDimorphsimMetascape.zip
14.66 MB
-
BodySizeTissueEnrichment.csv
8.89 KB
-
PAMLresultsBodySize.csv
20.95 KB
-
PAMLresultsTestesSize.csv
21.08 KB
-
PAMLtrees.txt
595 B
-
README.md
6.45 KB
-
RelativeTestesSizeConservedTestesMetascape.zip
14.66 MB
-
RelativeTestesSizeMetascape.zip
6.49 MB
-
RERResultsBodySizeDimorphism.csv
1.10 MB
-
RERResultsTestesSize.csv
1.10 MB
-
TestesMassData.csv
2.03 KB
-
TestesSizeTissueEnrichment.csv
11.69 KB
Abstract
Sexual selection is the differential reproductive success of individuals, resulting from competition for mates, mate choice, or success in fertilization. In primates, this selective pressure often leads to the development of exaggerated traits which play a role in sexual competition and successful reproduction. In order to gain insight into the mechanisms driving the development of sexually selected traits, we used an unbiased genome-wide approach across 21 primate species to correlate individual rates of protein evolution to relative testes size and sexual dimorphism in body size, two anatomical hallmarks of sexual selection in mammals. Among species with presumed high levels of sperm competition, we detected strong conservation of testes-specific proteins responsible for spermatogenesis and ciliary form and function. In contrast, we identified accelerated evolution of female reproductive proteins expressed in the vagina, cervix, and fallopian tubes in these same species. Additionally, we found accelerated protein evolution in lymphoid tissue, indicating that adaptive immune functions may also be influenced by sexual selection. This study demonstrates the distinct complexity of sexual selection in primates revealing contrasting patterns of protein evolution between male and female reproductive tissues.
https://doi.org/10.5061/dryad.fn2z34v23
Description of the data and file structure
The following files provide phenotypic data used as input in the analyses:
BodyMassData.csv
Data on male and female body mass (kg) used in this study. “N” refers to the numbers of individuals in the cited studies. “na” indicates that numbers of individuals were not provided in the cited studies.
TestesMassData.csv
Data on testes and body mass used in this study. Expected testes weight was calculated using the function Y = 0.034X0.68 as described in Kenagy and Tombulak (1986, J Mammal 67:1-22). Where X = body weight and Y = expected testes weight. Here relative testes size is equivalent to the observed testes size divided by the expected.
The following files provide results of our analyses:
23species80percent.txt
Phylogenetic branch length estimates for each protein, as estimated from the aaml program of the PAML package (Yang 2007, Mol Biol Evol 24:1586-1591).
RERResultsBodySizeDimorphism.csv
Correlation of the relative evolutionary rate (RER) for each protein. Rho is the correlation coefficient between RER and the phenotypic vector, in this case the degree of sexual dimorphism in body mass. N is the number of species included for that protein. P is the nominal P-value; p.adj is the Benjamini-Hochberg corrected p-value; permpval is the p-value estimated with the phylogenetically restricted permutations, or permulations, as described in Saputra et al. (2021, Mol Biol Evol 38:3004-21); permpvaladj is the permulation p-value corrected with the Benjamini-Hochberg procedure.
RERResultsTestesSize.csv
Correlation of the relative evolutionary rate (RER) for each protein. Rho is the correlation coefficient between RER and the phenotypic vector, in this case the mass of the testes relative to body mass. N is the number of species included for that protein. P is the nominal P-value; p.adj is the Benjamini-Hochberg corrected p-value; permpval is the p-value estimated with the phylogenetically restricted permutations, or permulations, as described in Saputra et al. (2021, Mol Biol Evol 38:3004-21); permpvaladj is the permulation p-value corrected with the Benjamini-Hochberg procedure.
TestisSizeTissueEnrichment.csv
Results of the tissue enrichment analyses with proteins whose evolution is significantly accelerated or conserved in the RER involving relative testes size (see Methods). Description of columns: HPA.stat = enrichment statistic; HPA.pval = nominal p-value; HPA.p.adj = p-value corrected with the Benjamini-Hochberg procedure; HPA.num.genes = number of genes (proteins) whose expression is enriched the tissue; HPA.permpval = p-value estimated with the phylogenetically restricted permutations, or permulations, as described in Saputra et al. (2021, Mol Biol Evol 38:3004-21); HPA.permpvalueadj = permulation p-value corrected with the Benjamini-Hochberg procedure.
BodySizeTissueEnrichment.csv
Results of the tissue enrichment analyses with proteins whose evolution is significantly accelerated or conserved in the RER involving body size sexual dimorphism (see Methods). Description of columns: HPA.stat = enrichment statistic; HPA.pval = nominal p-value; HPA.p.adj = p-value corrected with the Benjamini-Hochberg procedure; HPA.num.genes = number of genes (proteins) whose expression is enriched the tissue; HPA.permpval = p-value estimated with the phylogenetically restricted permutations, or permulations, as described in Saputra et al. (2021, Mol Biol Evol 38:3004-21); HPA.permpvalueadj = permulation p-value corrected with the Benjamini-Hochberg procedure.
BodySizeDirmorphsimMetascape.zip
Zip file containing the complete output from express analysis in Metascape (Zhou et al. Nature Commun. 2019, 10:1523) comparing accelerated and conserved proteins in species with large body size dimorphism. Analyses run include gene list enrichment analysis and protein-protein interaction network analyses.
RelativeTestesSizeConservedTestesMetascape.zip
Zip file containing the complete output from express analysis in Metascape (Zhou et al. Nature Commun. 2019, 10:1523) comparing conserved testes-specific proteins for each trait. Analyses run include gene list enrichment analysis and protein-protein interaction network analyses.
RelativeTestesSizeMetascape.zip
Zip file containing the complete output from express analysis in Metascape (Zhou et al. Nature Commun. 2019, 10:1523) comparing accelerated and conserved proteins in species with large relative testes. Analyses run include gene list enrichment analysis and protein-protein interaction network analyses.
PAMLtrees.txt
Phylogenetic tree in Newick format designating foreground and background branches for use in PAML analyses branch models.
PAMLresultsTestesSize.csv
Likelihood estimates for three models in PAML (M0 = uniform dN/dS; two-branch = one dN/dS for foreground; one dN/dS for background; and two-branch neutral = one dN/dS for background; foreground dN/dS = 1), for the top 200 proteins (ranked by p-value) in the RER correlation analysis involving relative testes size.
PAMLresultsBodySize.csv
Likelihood estimates for three models in PAML (M0 = uniform dN/dS; two-branch = one dN/dS for foreground; one dN/dS for background; and two-branch neutral = one dN/dS for background; foreground dN/dS = 1), for the top 200 proteins (ranked by p-value) in the RER correlation analysis involving body size dimorphism.
Sharing/Access information
Sequence files and alignments were taken from the UCSC Genome Browser’s Mammal Multiz Alignment & Conservation Track: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz30way/
Code/Software
Three python scripts were used for data manipulation, formatting, and automating. They are:
CountLengthsofSpliceVariants.txt
python script to identify longest splice variant for each gene and rename genes for analysis in aaml.
Codeml_Stopcodons.txt
python script to remove stop codons in alignments of coding sequence prior to analyses using the codeml program of the PAML package (Yang 2007, Mol Biol Evol 24:1586-1591).
aaml.txt
python script to run the aaml program of the PAML package (Yang 2007, Mol Biol Evol 24:1586-1591) and consolidate results for analysis in RER converge.
Analyses were conducted using the Mammals Multiz Alignment & Conservation (27 primates) amino acid alignments from the UCSC genome browser. This dataset includes 27 primate species and 3 outgroups, containing 19,975 amino acid sequence alignments of known human genes aligned against the translated sequence from the genome assemblies of other species. For each alignment branch lengths were estimated using the aaml program in the Phylogenetic Analysis by Maximum Likelihood (PAML) package version 4.9 (Yang 2007, Mol Biol Evol 24:1586-91), using an empirical model of amino acid substitution rates with rate variability between sites modeled as a gamma distribution approximated with four discrete classes and an additional class for invariable sites (aaml model ‘Empirical + F’).
For each primate species, relative testes size was calculated with consideration to the allometric relationship between testes mass and body mass. This was calculated as the logarithm of the ratio of observed testes size compared to the expected testes size generated using the function Y = 0.034X0.68 where Y is the expected mass of both testes in grams and X is the observed body mass in grams (Kenagy and Trombulak 1986, J Mammal 67:1-22). Values of body mass for males and females of each species were taken from Smith and Jungers (1997, J Hum Evol 32:523-9), except for those of the drill because of their small sample size; instead, drill body mass was taken from Setchell (2017, Intl Enc Primatol). Body size dimorphism was calculated as the log-transformed ratio of male body weight over female body weight. Ancestral states were estimated through phylogenetic modeling based on extant species values using the phytools R package (Revell 2012, Methods Ecol Evol 3:217-23).
To perform our primary analysis, we used the RERconverge pipeline (Kowalczyk et al. 2019, Bioinformatics 35:4815-17) which quantifies the correlation between the rate of change of a trait and the rate of protein evolution allowing for the identification of proteins which are evolving in response to selection related to a particular phenotype along independent lineages. Using linear regression, we identified the degree of divergence for each protein from its expected rate of evolution to calculate the relative evolutionary rate (RER). Phenotype vectors and calculated RERs for all branches were correlated using Pearson linear correlations and were adjusted using the Benjamini-Hochberg procedure as well as 1000 phylogenetically restricted permutations, or permulations, as described in Saputra et al. (2021, Mol Biol Evol 38:3004-21).
Expression summaries were collected from the Human Protein Atlas (proteinatlas.org) for each nominally significant protein for both traits (uncorrected p < 0.05) including tissue expression (Uhlén et al. 2015, Science 347:1260419) and single-cell type expression data (Karlsson et al. 2021, Sci Adv 7:eabh2169). Tissue enrichment analyses were performed for 38 tissues where a Wilcoxon Rank-Sum statistic was estimated using RERConverge commands to compare the rho value multiplied by the negative log of the p-value for correlations of genes within the tissue against the same value for all genes. Metascape (https://metascape.org) with default parameters was used for Gene Ontology enrichment analysis of conserved and accelerated proteins with uncorrected p-values and permulation p-values less than 0.05 (Zhou et al. 2019, Nat Commun 10:1523).
To differentiate between positive selection and relaxed constraint for the top 200 proteins identified in the RER correlation for each trait, ranked by p-value, we used codeml from the PAML package (Yang 2007) to fit the data (alignment of coding regions from the Mammals Multiz Alignment & Conservation nucleotide alignments) to a model constrained to a single ratio of the nonsynonymous substitution rate to the synonymous substitution rate (dN/dS) across all branches of the phylogeny (uniform ratio model, M0). The likelihood of two other models were estimated, a branch model where two sets of branches (foreground and background) were each allowed to have their own dN/dS, and another branch model where the foreground was constrained to neutrality (dN/dS = 1). For each trait, in these models the terminal branches leading to any extant species in the top one-third of the trait value were specified as foreground, and the remaining branches as background. To determine if the foreground and background branches are evolving differently as fit to these models, a likelihood ratio test (LRT) compared the likelihoods of the two-branch model to the uniform model with significance assessed using a chi-square test with one degree of freedom. To test for positive selection across the entire gene, an LRT compared the two-branch model to the neutral foreground model; the was considered as evidence for positive selection if the result was statistically significant and the foreground dN/dS was greater than one.