Data from: Analysis of statistical correlations between properties of adaptive walks in fitness landscapes
Cite this dataset
Reia, Sandro M.; Campos, Paulo R. A. (2020). Data from: Analysis of statistical correlations between properties of adaptive walks in fitness landscapes [Dataset]. Dryad. https://doi.org/10.5061/dryad.41ns1rn9r
Abstract
Usage notes
This repository contains the datasets of the different fitness landscapes and the codes to generate the adaptive walks presented in the paper Analysis of statistical correlations between properties of adaptive walks in fitness landscapes.
There are three zipped folders in this repository, namely AdaptiveWalks_Codes, Empirical_Landscapes and NK_Samples. In the following, we describe the contents of each folder.
AdaptiveWalks_Codes:
1) The code Adaptivewalk-random-hsp90.cpp generates random adaptive walks in the Hsp90 landscape. The empirical data from the Hsp90 fitness landscape can be obtained from C. Bank et at. PNAS 113, 14085 (2016).
As output one has estimates for mean walk length, predictability, mean path divergence, accessibility for each local optimum; and finally fitness values of those local optima.
The data about fitness values and connectivities of the sequences are already embedded in the code, whereas for the estimate of the mean path divergence the calculation of hamming distance between all pairs of sequences is obtained from the file hamming_distance_tres_colunas.txt
To compile the code:
c++ -O3 AdaptiveWalk-random-hsp90.cpp -o AdaptiveWalk-random-hsp90 -lm -lgsl -lgslcblas
To run the code:
./script-random-hsp90
To change the number of adaptive walks just change the script
2) The code Adaptivewalk-prob-hsp90.cpp generates probabilistic adaptive walks in the Hsp90 landscape.
The instructions are the same as the ones for the random version.
3) The code Adaptivewalk-hsp90-randomfreq.cpp generates random adaptive walks in the Hsp90 landscape to calculate the frequency of the mutational pathways produced through the dynamics. Note that the same information is used in 1), but in case one needs a better statistics for the evaluation of the path frequencies, which also warrants that a minimum number of walks is satisfied for every local optimum, the code is more appropriate. The input here is this minimum number of walks terminating at the least visited local optimum.
To compile the code:
c++ -O3 AdaptiveWalk-hsp90-randomfreq.cpp -o AdaptiveWalk-hsp90-randomfreq -lm -lgsl -lgslcblas
To run the code:
./script-random-hsp90-freq
4) The code Adaptivewalk-hsp90-probfreq.cpp generates probabilistic adaptive walks in the Hsp90 landscape to calculate the frequency of the mutational pathways produced through the dynamics. So, the remaining information is exactly the same as in 3).
To compile the code:
c++ -O3 AdaptiveWalk-hsp90-probfreq.cpp -o AdaptiveWalk-hsp90-probfreq -lm -lgsl -lgslcblas
To run the code:
./script-probabilistic-hsp90-freq
5) The code Adaptivewalk-GB1-randomfreq.cpp generates random adaptive walks in the GB1 fitness landscape to calculate the frequency of the mutational pathways produced through the dynamics, but also mean walk length, predictability, mean path divergence and accessibility for each local optimum. Likewise, the code warrants that a minimum number of walks is satisfied for every local optimum, which is provided in the script. This is bit tricky, as one of the local optimum of the GB1 landscape is poorly visited through the walks starting at the wild type sequence. All the information about the GB1 landscape is provided by the processed information and contained in the files elife_seq_number.txt, elife_sequence_degree.txt, elife_sequence_fitness.txt and elife_sequence_neighbors_correta.txt. The latter one could not be uploaded (430 Mb), but all those files can be generated from the code available in the folder Gb1_input_files (please have a look at the file readme.md), which will handle the original data from the manuscript by Wu et al. Elife 5, e16965 (2016).
To compile the code:
c++ -O3 AdaptiveWalk-GB1-randomfreq.cpp -o AdaptiveWalk-GB1-randomfreq -lm -lgsl -lgslcblas
To run the code:
./script-GB1-randomfreq
6) The code Adaptivewalk-GB1-probfreq.cpp generates probabilistic adaptive walks in the GB1 fitness landscape to calculate the frequency of the mutational pathways produced through the dynamics, but also mean walk length, predictability, mean path divergence and accessibility for each local optimum. The remaining information is exactly the same as in 5).
To compile the code:
c++ -O3 AdaptiveWalk-GB1-probfreq.cpp -o AdaptiveWalk-GB1-probfreq -lm -lgsl -lgslcblas
To run the code:
./script-GB1-probabilistic_freq
Empirical_Landscapes:
In this folder we present the empirical landscapes we have used in our study. Each file has two columns. The fist column presents the sequence and the second column presents the respective fitness value.
The file HSP90_fitness_landscape.txt contains the Hsp90 empirical landscape from C. Bank, S. Matuszewski, R. T. Hietpas, and J. D. Jensen, Proceedings of the National Academy of Sciences 113, 14085 (2016).
The code cleaning_file_HSP90.py reads the data from the HSP90 fitness landscape and creates the ring structure seen in Figure 1 of the manuscript. This figure can be generated by downloading the files HSP90_fitness_landscape.txt and cleaning_file_HSP90.py to the same folder and running the script cleaning_file_HSP90.py.
The file GB1_fitness_landscape.txt contains the Gb1 empirical landscape from N. C. Wu, L. Dai, C. A. Olson, J. O. Lloyd-Smith, and R. Sun, Elife 5, e16965 (2016).
NK_Samples:
This folder contains the 10 samples of the NK landscapes used to generate the correlation matrix, in which N = 8 and K = 1, 2 or 3. Each file has two columns.
The first column gives the decimal representation of the binary sequence of length N. The second column gives the fitness value of the respective sequence.