Skip to main content

Learning protein fitness models from evolutionary and assay-labeled data

Cite this dataset

Hsu, Chloe; Nisonoff, Hunter; Fannjiang, Clara; Listgarten, Jennifer (2021). Learning protein fitness models from evolutionary and assay-labeled data [Dataset]. Dryad.


Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily-related sequences, or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one density feature from modelling the evolutionary data. Within this approach, we find that a variational autoencoder-based density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.


Multiple sequence alignments. We use the MSAs provided by EVmutation [1] whenever possible. For the green fluorescent protein (GFP), the only exception, we follow the same procedure as EVmutation to gather sequences using the profile HMM homology search tool jackhmmer [2]. We determine the bit score threshold in jackhmmer search with the same criterion from EVmutation. In particular, for GFP, we started with 0.5 bits/residue and subsequently lowered the threshold to 0.1 bits/residue to meet the sequence number requirement (redundancy-reduced number of sequences >= 10L where L is the length of the aligned region). 

Mutation effect data sets. The mutation effect data sets here are a processed version of the data sets collected by Hopf et al. [1]. Hopf et al. identified a list of mutation effect data sets generated by mutagenesis experiments of entire proteins, protein domains, and RNA molecules. We exclude the data sets for RNA molecules and influenza virus sequences, as well as excluding data sets that contain fewer than 100 entries, in order to have meaningful train/test splits with at least 20 examples in test data. This leaves us with 18 data sets from EVmutation. Following the convention in EVmutation [1] and DeepSequence [3], we exclude sequences with mutations at positions that have more than 30% gaps in MSAs, to focus on regions with sufficient evolutionary data. On most data sets, this excludes less than 10% of the data, although for a few proteins such as GFP this affects as much as half of the positions. For example, on GFP, out of 237 positions, only positions 15-150 pass the criterion of less than 30% gaps in the MSA. Coincidentally, the selected position 15-150 region covers the 81 amino region studied by Biswas et al. [4].

[1] Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nature Biotechnology 35, 128–135 (2017).

[2] Eddy, S. R. Profile hidden markov models.Bioinformatics (Oxford, England) 14, 755–763 (1998).

[3] Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capturethe effects of mutations. Nature Methods 15, 816–822 (2018).

[4] Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering withdata-efficient deep learning. Nature Methods 18, 389––396 (2021).

Usage notes

Mutation effect data sets. In the "processed_data/” directory, each sub-directory contains a wild-type sequence ("wt.fasta") and a list of mutation effect data ("data.csv").

Model performance comparison. The file “all_results.csv” contains the performance of each method over varying random seeds, training data sizes, and data sets.  

Multiple sequence alignments. In the "alignments/" directory, each a2m file is a multiple sequence alignment.

Potts model parameters. In the “coupling_models/” directory, each ".model_params" file is the Potts model parameters for a protein family in the plmc [1] format.

"Evo-tuned" eUniRep weights. Each sub-directory in “unirep_weights/” contains numpy (.npy) files for eUniRep parameters compatible with UniRep open source code [2].




National Institute of Diabetes and Digestive and Kidney Diseases, Award: T32LM012417

National Science Foundation, Award: DGE 2146752

Lawrence Livermore National Laboratory, Award: SCW1710

Chan Zuckerberg Initiative (United States)