Skip to main content

Fitness effects of mutations: An assessment of PROVEAN predictions using mutation accumulation data

Cite this dataset

Sandell, Linnea; Sharp, Nathaniel (2022). Fitness effects of mutations: An assessment of PROVEAN predictions using mutation accumulation data [Dataset]. Dryad.


Predicting fitness in natural populations is a major challenge in biology. It may be possible to leverage fast-accumulating genomic datasets to infer the fitness effects of mutant alleles, allowing evolutionary questions to be addressed in any organism. In this paper, we investigate the utility of one such tool, called PROVEAN. This program compares a query sequence with existing data to provide an alignment-based score for any protein variant, with scores categorized as neutral or deleterious based on a preset threshold. PROVEAN has been used widely in evolutionary studies, e.g., to estimate mutation load in natural populations, but has not been formally tested as a predictor of aggregate mutational effects on fitness. Using three large, published datasets on the genome sequences of laboratory mutation accumulation lines, we assessed how well PROVEAN predicted the actual fitness patterns observed, relative to other metrics. In most cases, we find that a simple count of the total number of mutant proteins is a better predictor of fitness than the number of variants scored as deleterious by PROVEAN. We also find that the sum of all mutant protein scores explains variation in fitness better than the number of mutant proteins in one of the datasets. We discuss the implications of these results for studies of populations in the wild.


We used previously published datasets of growth rates of, and mutations in, mutation accumulation lines in Saccharomyces cerevisiae and Chlamydomonas reinhardtii. We computed the mutated proteins and ran the protein variant, as compared to the laboratory ancestor, through PROVEAN.

We ran PROVEAN on the ComputeCanada cluster. As the program failed to run with the recent BLAST software (version 2.9.0), we configured PROVEAN to run with PSI-BLAST and BLASTDBCMD (Altschul et al. 1997)from BLAST version 2.4.0. We used version 4.8.1 of CD-HIT. We ran our variants with the NCBI nr database from 12/11/2019, which holds 142 GB of non-redundant sequences (229,636,095 sequences). We ran a subset of variants using the 2012 database, on which PROVEAN was developed (the first 5 GB), without radical changes to the PROVEAN scores of variants. The supporting sequence sets used to compute the alignment scores for all proteins were saved.

Sc1 We used the mutations reported in Sharp et al. (2018; Dataset_S2.xlsx). There were 1474 genic mutations in the dataset, occurring in 1219 unique genes across 218 MA lines. We extracted the nucleotide and protein sequence of the genes affected using YeastMine (Balakrishnan et al. 2012). From the same database, we downloaded the location of introns in these genes. The reference nucleotide sequence was then mutated in silicoto represent the mutant sequence, which was then transcribed and translated, using the seqinr package (Charif and Lobry 2007)in R (R Core Team 2019). Additionally, we analyzed VCF files to obtain a table of mutations in the ancestral line as compared to the yeast reference genome (version R64-2-1). In cases where the ancestor and reference strain differed for a mutated gene (126 genes) we separately computed the ancestral protein and used it for comparison to the MA lines. We wrote a script to produce protein variants in the format PROVEAN requires. From 1474 genic mutations, 1126 protein variants were computed (in 961 unique proteins). Two samples (lines 113 and 206) had no nonsynonymous mutations. When an MA line had more than one nonsynonymous mutation in a particular gene both mutations were considered when altering the protein and the number of mutant proteins is reported once. Out of 961 altered proteins, 126 already differed between the S288C reference genome and the laboratory ancestor, in which case the latter was used as the query sequence.

Sc2 We used the mutations reported in Liu and Zhang (2019; Data_S1.xlsx). Additionally, the authors supplied us with a table of mutations in their ancestral line relative to the S288C reference genome. We used the same method as described above for dataset Sc1. There were 1147 genic mutations, occurring in 968 unique genes, across 165 MA lines. From 1147 genic mutations, 877 protein variants were computed (in 754 unique proteins). Out of 754 altered proteins, 16 already differed between the S288C reference genome and the laboratory ancestor, in which case the latter was used as the query sequence.

Cr We received an annotated table of the mutations reported in Ness et al. (2015)as well as VCF files containing the mutations in their six ancestral lines compared to the reference genome. We downloaded an annotated table for all transcripts in the Chlamydomonasreference genome from Dicots PLAZA 4.0 (version 5.5, Van Bel et al., 2018)to identify mutations in coding sequences. Out of the original 6843 mutations, 3889 affected protein sequence, representing 1439 mutated proteins after combining mutations. We found that the majority of transcripts that were mutated during mutation accumulation already had existing variants in the ancestral strain, relative to the reference (table 1). 1397 out of the originally predicted 1439 protein variants remained once ancestral variation had been considered (table 1). As in the other datasets, we use the ancestral protein as the query protein. We found 2 cases in the C. reinhardtiidataset where the reported reference nucleotide deviated from that found in the Dicots PLAZA 4.0 sequence; in each case, the differences between the two reference sequences were synonymous. This discrepancy was likely due to the two different reference genomes used (Ness et al. used v5.3; Van Bel et al. used v5.5). To test the accuracy of our sequence-mutating code, we mutated the coding sequence to the reference nucleotide given by the C. reinhardtiidataset and verified that this produced the reference transcript. We converted the protein variants into the format PROVEAN requires. In cases with alternative transcripts, we treat these as separate proteins in PROVEAN and then report the minimum score given to any protein variant of a gene. This occurred in 42 unique cases, involving all genetic backgrounds. While the difference in scores between transcripts in general was small, we found two cases where the score for one affected transcript was below the default threshold of –2.5 while the other was above it, and six cases where the scores fell above and below zero. Six out of the total 1397 protein variants failed to receive a score from PROVEAN, likely because the changes to the protein were too large to compute alignment scores between the clusters gathered and the mutant protein and were ignored in the analysis (these occurred in six different samples across five ancestral backgrounds).

Usage notes

Details and usage notes can be  found in Sandell_provean_project_README.txt.