Data from: The structure of an ancient genotype-phenotype map shaped the functional evolution of a protein family
Data files
May 23, 2025 version files 276.39 MB
-
NovaSeq_DMS_datasets.tar.gz
187.42 MB
-
raw_read_counts.tar.gz
88.97 MB
-
README.md
5.05 KB
Abstract
Mutation is more likely to produce some phenotypes than others, but the causal role of these production propensities in the evolution of phenotypic diversity remains unclear. There are two major challenges: it is difficult to separate the effect of the genotype-phenotype (GP) map from that of natural selection when analyzing natural diversity, and most extant phenotypes evolved long ago in species whose GP maps cannot be recovered. Using reconstructed ancestral transcription factors, we created libraries containing all possible amino acid combinations at historically variable sites in the proteins’ DNA binding interface (the genotypes) and measured their capacity to specifically bind DNA elements containing all possible combinations of nucleotides at historically variable sites (the phenotypes). The two ancestral GP maps were strongly anisotropic (the distribution of phenotypes encoded by genotypes is highly nonuniform) and heterogeneous (the phenotypes accessible around each genotype vary dramatically among genotypes), but the extent and direction of these properties differed dramatically between the maps. In both cases, these properties steered evolution toward the lineage-specific phenotypes that evolved during history. Our findings establish that ancient properties of the GP relationship were causal factors in the evolutionary process that produced present-day patterns of functional conservation and diversity.
This dataset contains the raw read counts and the estimated mean fluorescence for all protein genotypes across all 16 response elements for three experimental replicates.
We used fluorescence-activated cell sorting (FACS) to separate cells based on their GFP expression. We performed two rounds of sorting: an initial “enrichment sort” to enrich for GFP+ variants in the full libraries, and a second, higher resolution “binned sort” on the enriched libraries to generate quantitative fluorescence estimates for each variant. To normalize fluorescence to cell volume, GFP gates were drawn to have a slope of 1.5 on a log(FSC-A)-log(GFP) plot. We sorted 2.5×10^7 cells per library in the enrichment stage.
The binned sort was performed to yield three replicates per library. Binned sorting followed the enrichment sort protocol but used four GFP bins instead of two, with ~1.6×10^8 cells collected per replicate.
Sequencing libraries were constructed from plasmids extracted from the enrichment sort GFP– population and the four binned sort populations. Replicate 1 of the binned sort libraries was sequenced on a NextSeq High Output run. The remaining replicates were sequenced on a NovaSeq S1 run.
Quantitative estimates of mean fluorescence per protein-DNA complex were estimated from the distribution of reads across the four fluorescence bins.
Corresponding author information
Name: Joseph W. Thornton Affiliation: Department of Ecology and Evolution, and Department of Human Genetics, University of Chicago, Chicago, IL, USA email: joet1@uchicago.edu
Related publication
Herrera-Álvarez, S., Patton, J.E.J., Thornton, J.W. (ACCEPTED; May-2025) The Structure of an Ancient Genotype-Phenotype Map Shaped the Functional Evolution of a Protein Family. Nature Ecology and Evolution.
Files and variables
File: raw_read_counts.tar.gz
Description: The compressed directory raw_read_counts.tar.gz, contains a total of 5 directories and 908 csv files. Each file contains raw read counts for every protein-RE complex in the enrichment and binned sorts across three experimental replicates (includes counts from NextSeq and NovaSeq runs). See Supplementary Information “Details on replicate sorting, sequencing and processing” for further details.
When applicable, file names in folders binned_REP1, binned_REP2 ,binned_REP3 and binned_REP4 follow the pattern:
AncSR(01)_REBC(02)_BB(03)_REP(04)_AA_var_count.csv
- `(01)` denotes the DBD protein background
- AncSR1
- AncSR2
- `(02)` denotes the Response Element Barcode (REBC). There are 16 RE elements, therefore there are 16 REBCs
- `(03)` denotes the Bin Barcode (BB). There are four fluorescent bins, therefore there are 4 BBs.
- `(04)` denotes the experimental replicate
When applicable, file names in folder Debulk follow the pattern
AncSR(01)_REBC(02)_GFPneg_Debulk_AA_var_count.csv
- `(01)` denotes the DBD protein background
- AncSR1
- AncSR2
- `(02)` denotes the Response Element Barcode (REBC). There are 16 RE elements, therefore there are 16 REBCs
For each csv file, the following columns are included:
- `AA_var`: The amino acid genotype at the four RH sites
- `Count`: Number of reads
- `REBC`: Identifier of the protein background and REBC
- `BinBC`: Identifier of the BB (BB1-BB4). If file is from `Debulk` sort, the identifier is `GFPneg`
- `REP`: Identifier of the experimental replicate (REP1-REP4). If file is from `Debulk` sort, the identifier is `Debulk`
File: NovaSeq_DMS_datasets.tar.gz
Description: The compressed directory NovaSeq_DMS_datasets.tar.gz, contains a 5 csv files. Each file contains the estimated fluorescence of ech protein-DNA complex from the read counts (includes counts from NextSeq and NovaSeq runs). Note: estimated fluorescence only applies for the binned sort replicates.
For each csv file, the following columns are included:
- `AA_var`: The amino acid genotype at the four RH sites
- `REBC`: Identifier of the protein background and REBC
- `Count_b1`: Number of reads in Bin 1
- `Count_b2`: Number of reads in Bin 2
- `Count_b3`: Number of reads in Bin 3
- `Count_b4`: Number of reads in Bin 4
- `Count_total`: Total read count across 4 bins
- `REP`: Identifier of the experimental replicate (REP1-REP4)
- `cellCount_b1`: Estimated number of cells with RH genotype in Bin 1
- `cellCount_b2`: Estimated number of cells with RH genotype in Bin 2
- `cellCount_b3`: Estimated number of cells with RH genotype in Bin 3
- `cellCount_b4`: Estimated number of cells with RH genotype in Bin 4
- `meanF`: Estimated mean fluorescence of protein-DNA complex
Code/Software
All the reproducible code to use and run the data can be found at https://github.com/JoeThorntonLab/RH-RE_scanning
