Data from: Fitness landscapes of human microsatellites

Published Jun 30, 2025 on Dryad. https://doi.org/10.5061/dryad.sbcc2frg6

Data files

Jun 30, 2025 version files 133.99 MB

abcrfModelSelection.R

5.61 KB
abcrfParameterEstimation.R

2.63 KB
dinucleotide_reference_table.tsv

132.25 MB
empirical_data_file.csv

259.36 KB
fortuna-msat.tar.gz

357.79 KB
raw_and_calibrated_genotypes.xlsx

1.10 MB
README.md

15.57 KB

Abstract

Advances in DNA sequencing technology and computation now enable genome-wide scans for natural selection to be conducted on unprecedented scales. By examining patterns of sequence variation among individuals, biologists are identifying genes and variants that affect fitness. Despite this progress, most population genetic methods for characterizing selection assume that variants mutate in a simple manner and at a low rate. Because these assumptions are violated by repetitive sequences, selection remains uncharacterized for an appreciable percentage of the genome. To meet this challenge, we focus on microsatellites, repetitive variants that mutate orders of magnitude faster than single nucleotide variants, can harbor substantial variation, and are known to influence biological function in some cases. We introduce four general models of natural selection that are each characterized by just two parameters, are easily simulated, and are specifically designed for microsatellites. Using a random forests approach to approximate Bayesian computation, we fit these models to carefully chosen microsatellites genotyped in 200 humans from a diverse collection of eight populations. Altogether, we reconstruct detailed fitness landscapes for 43 microsatellites we classify as targets of selection. Microsatellite fitness surfaces are diverse, including a range of selection strengths, contributions from dominance, and variation in the number and size of optimal alleles. Microsatellites that are subject to selection include loci known to cause trinucleotide expansion disorders and modulate gene expression, as well as intergenic loci with no obvious function. The heterogeneity in fitness landscapes we report suggests that genome-scale analyses like those used to assess selection targeting single nucleotide variants run the risk of oversimplifying the evolutionary dynamics of microsatellites. Moreover, our fitness landscapes provide a valuable visualization of the selective dynamics navigated by microsatellites.

https://doi.org/10.5061/dryad.sbcc2frg6

Description of repository contents and their use

This repository is associated with the article Haasl RJ and Payseur BA (2024) Fitness landscapes of human microsatellites. PLoS Genetics, 20:e1011524.

The empirical data set includes genotype data for 200 humans originating from the following eight 1000 Genomes populations. 1000 Genomes populations included in the empirical data

CEU: CEPH/Utah (n=25)
CHB: Han Chinese in Beijing, China (n=27)
FIN: Finnish in Finland (n=30)
GIH: Gujarati Indian in Houston, TX (n=25)
LWK: Luhya in Webuye, Kenya (n=25)
MXL: Mexican ancestry in Los Angeles (n=18)
TSI: Toscani in Italy (n=25)
YRI: Yoruba in Ibadan (n=25)

Using a method described in Supplementary file S1 Text of the publication, we converted genotypes assayed as fragment lengths in units of base pairs (raw genotypes) into allele sizes -- i.e., the number of times the microsatellite motif is repeated (calibrated genotypes).

We simulated microsatellite datasets identical in structure to the empirical data set using a modified version of the forward-in-time population genetic simulator FORTUNA (Haasl 2022) that allowed us to generate microsatellite genotypes following models of demography, natural selection (or neutral evolution), and mutation whose details were specified in a parameters file read by FORTUNA. For sake of distinction, we refer to the modified program as FORTUNA-msat.

We compared empirical genotypic data to simulated data using the inferential framework of approximate Bayesian computation (ABC), and performed both model selection and parameter estimation. More specifically, we used a variant of ABC named ABC Random Forests (ABC-RF), which statistically legitimizes model selection in the context of ABC and obviates the need for pre-selection of sufficient summary statistics. ABC-RF analyses were performed using two custom R scripts.

Files in this repository contain the raw and calibrated empirical genotypes, the source code for FORTUNA-msat, a parameters file that specifies the demographic model used in all analyses, and R scripts used to perform ABC-RF inference. Collectively, these files provide the necessary data and tools to recapitulate the analyses detailed in our publication.

Empirical data file

File Name: raw_and_calibrated_genotypes.xlsx

Description: This Excel file contains two worksheets named raw and calibrated that hold the raw and calibrated genotype data described above.

Worksheet raw_genotypes

Contains the raw frament lengths for all microsatellite and sampled individuals.
First column:
- pop: the 1000 Genomes pouplation from which the individual originates
Subsequent columns: every two columns contain the two raw allele calls (fragment lengths in base pairs) at a single locus for each of the individuals. The first three rows of these columns include the following information, in order:
- The ID used to name the microsatellite.
- The excpected selection category to which the microsatellite belonged - e.g., neut.28 is a putatively neutral microsatellite while sel.26 is a microsatellite we treated as a candidate for selection.
- The group of microsatellites to which the microsatellite belonged - e.g., CA20 is a putatively neutral microsatellite with a CA motif and 20x reference allele size while a gene identifier in this row such as ILR2B indicates that the microsatellite is intragenic
Raw fragment lengths are recorded as NULL if genotyping was unsuccessful.

Worksheet calibrated_genotypes

Contains the calibrated allele sizes for all genotyped microsatellite and sampled individuals.
First column:
- pop: the 1000 Genomes pouplation from which the individual originates
Subsequent columns: every four columns contain first the raw genotypes and then the two calibrated allele sizes at a single locus for each of the individuals. The first three rows of these columns include the following information, in order:
- The ID used to name the microsatellite.
- The expected selection category to which the microsatellite belonged - e.g., neut.28 is a putatively neutral microsatellite while sel.26 is a microsatellite we treated as a candidate for selection.
- The group of microsatellites to which the microsatellite belonged - e.g., CA20 is a putatively neutral microsatellite with a CA motif and 20x reference allele size while a gene identifier in this row such as ILR2B indicates that the microsatellite is intragenic
Allele sizes are recorded as NULL if genotyping was unsuccessful.

Simulation program FORTUNA-msat

File: fortuna-msat.tar.gz

Description: This compressed tar file includes the source files for FORTUNA-msat, a parameters file that incorporates the demographic model used in all simulations.

Unpacking and compiling

In a Unix environment, unpack the files by running

tar -xvzf fortuna-msat.tar.gz

Then, to compile, run

gcc fortuna-msat.cc -o fortuna

Preparing to run a simulation

Make sure that the compiled program fortuna is in the same directory as the parameters file parameters and the file named starting_allele_frequency_distributions_v2. The later of these contains 20,000 randomly generated starting allele frequency distributions, from which each run of the program randomly selects one of these distributions to begin with.

Execution of compiled program

In a Unix terminal, in the directory with the appropriate files (see previous section), run

./fortuna 1

Note that the integer passed as a command line argument will be appended to all results files for a single simulation run. By changing the integer, the user can effectively prevent overwriting of past simulation results.
As set up, the program randomly selects a model of selection, and if it is not the neutral model both key allele size and strength of selection -- the two parameters estimated for loci identified as targets of selection. A user wishing to modify this functionality can modify the Metapopulation class constructor in metapopulation.h. For example, commented-out code in the constructor demonstrates how you can direct the program to only simulate one model of selection rather than having the program randomly choose a model.

Data files and R scripts for ABC-RF implementation

DATA File: empirical_data_file.csv

Description: This file is included to (1) provide an example file of empirical data to run the two R scripts and (2) allow users to replicate the results presented in the paper. The file contains features (columns 1-408) and contextual information (columns 409-418) for all loci analyzed in the paper. Each row provides the data for a single locus. Column headers have the following meanings:

Features f1-f51: Observed frequencies of alleles sized 1-50, followed by the observed heterozygosity for population YRI.
Features f52-f102: The same for population LWK.
Features f103-f153: The same for population GIH.
Features f154-f204: The same for population CHB.
Features f205-f255: The same for population MXL.
Features f256-f306: The same for population FIN.
Features f307-f357: The same for population TSI.
Features f358-f408: The same for population CEU.
Context id: Identifier for the locus
Context motif size: di, tri, or tetra
Context variationPool: "yes" if there was any variation in the full sample, "no" otherwise
Context vas: variance in allele size across the full sample (all populations)
Context mean: mean allele size across the full sample
Context heterozygosity: heterozygosity across the full sample
Context meanAF_less_meanNonAF: the result of mean allele size across the two African populations (YRI, LWK) minus the mean allele size across the six non-African populations
Context vasAF_less_vasNonAF: the result of variance in allele size across the two African populations (YRI, LWK) minus variance in allele size across the six non-African populations
Context hetAF_less_hetNonAF: the result of heterozygosity across the two African populations (YRI, LWK) minus the heterozygosity across the six non-African populations

DATA File: dinucleotide_reference_table.tsv

Description: This file is included as an example file to aid the user in formatting the reference table for ABC-RF inference. As with the previously described data file, each row includes the information for a single locus. However, because this is a reference table, each locus is a simulated locus. More specifically, these simulated results (i.e., the reference table for ABC-RF inference) are drawn from >131,000 FORTUNA-msat simulations in which a dinucleotide locus (mutation rate) was specified. The columns in this file are as follows:

(column 1) mean_startDist: the mean allele sizes of the allele frequency distribution randomly chosen by FORTUNA-msat to begin the simulation.
(column 2) m: the model simulated and randomly selected by FORTUNA-msat.
(column 3) s: the value of the selection coefficient randomly chosen by FORTUNA-msat.
(column 4) keyallele: the value of the key allele (referred to as alpha in the paper)
(column 5) mutation_tuner: the small perturbation of the base mutation rate whose value was randomly chosen by FORTUNA-msat and used to account for uncertainty in microsatellite mutation rate
(column 6) startFreqLine: the line in the file of starting allele frequency distributions randomly chosen by FORTUNA-msat, whose mean allele size is given in the first column.
(columns 7-414) The simulated features of the simulated locus provided in exactly the same order as the first 408 columns of the empirical data file (see description of the DATA file immediately preceding this one).

R SCRIPT file: abcrfModelSelection.r

Description: This file contains the code necessary to run ABC-RF model selection. It returns the posterior probablities on all five possible models, including the neutral model.

Usage

Save the script in the same directory as the two necessary input files:
- Reference table: a tab-separated file (.tsv) that contains the results of a large number of FORTUNA-msat simulations, one result on each line. Use your own reference table or the example reference table dinculeotide_reference_table.tsv provided.
- Empirical data: a (.csv) file containing the calibrated genotypes for all populations OR a file containing simulated, "ground-truth" data. Use your own data or the data analyzed in the paper and found in the proper format in the provided DATA file empirical_data_file.csv. Ground truth data derived from simulations can be used as input here to create a confusion matrix (see Usage below), which is one way to assess the performance of a method like ABC-RF.
Open the R script and edit the following lines to reflect the actual names of your reference table file and empirical data file, unless you are using the example DATA files provided.

#Load Data
reftab <- read.table(file = "dinucleotide_reference_table.", header  = T)
obs <- read.csv("empirical_data_file.csv")

Run lines 1-94 of the file, after adding the actual names of your files, to perform hyperparameter estimation and generate posterior probabilities for each model, where model categories should be interpreted as:
- 0: neutral model
- 1: additive, single-optimum model of selection (ASO)
- 2: dominant, single-optimum model of selection (DSO)
- 3: additive, periodic-optima model of selection (APO)
- 4: dominant, periodic-optima model of selection (DPO)
If your input empirical .csv file contains multiple loci to compare to the reference table, you can use the results from running lines 1-94 -- namely, object final_model -- to generate heat maps as visualizations of confusion matrices. Lines 96-100 provide a simple visualization, while subsequent lines provide a more involved approach using the R package ggplot2.

R SCRIPT file: abcrfParameterEstimation.r

Description: This file contains the code necessary to estimate the key allele size and selection strength for a microsatellite of interest.

Usage

Save the script in the same directory as the two necessary input files:
- Reference table: a tab-separated file (.tsv) that contains the results of a large number of FORTUNA-msat simulations, one result on each line. Use your own reference table or the example reference table dinculeotide_reference_table.tsv provided.
- Empirical data: a (.csv) file containing the calibrated genotypes and heterozygosities for all populations OR a file containing simulated, "ground-truth" data. Use your own data or the data analyzed in the paper and found in the proper format in the provided DATA file empirical_data_file.csv. Ground truth data derived from simulations can be used as input here to assess the performance of the ABC-RF method.
Open the R script and edit the following lines to reflect the names of your actual files, unless you are using the two example DATA files provided:

reftable <- read.table(file = "dinucleotide_reference_table.tsv", header  = T)
obs <- read.csv("empirical_data_file.csv")

Then edit these lines

moi = 1 # model of interest
loi = "PG06_00112_tbp" #keyallelelocus of interest
poi.a = params$keyallele # parameter of interest (s or keyallele)

by choosing 1, 2, 3, or 4 for the model of interest, a string that identifies the header of the genotypes corresponding to the locus of interest (as written in the id column of the empirical data file), and the parameter of interest - either pararms$keyallele or params$s, for key allele and selective strength, respectively.

After completing the inference process, pred.x contains estimate details.
Lines 53-54 when run produce a graph of the importance of the most informative summary statistics to the estimate.
Lines 56-62 when run will generate a graph of posterior density on the estimate.
Lines 64-66 when run will produce a text file report of the results for the locus of interest.

Human subjects data

We provide microsatellite genotype data for a subset of individuals in the 1000 Genomes Project, which were derived from DNA ordered from the Coriell NHGRI Sample Repository. The DNA we worked with was only associated with the 1000 Genomes Project ID of the individual from whom the DNA was derived.

Data from the 1000 Genomes Project are fully de-identified. Each sample is referenced only by a non-identifiable sample ID (e.g., NA17171), and no personal, clinical, or geolocation data are associated. Only minimal, high-level population metadata is included (e.g., CEU, YRI). Furthermore, the 1000 Genomes Project was reviewed and approved by appropriate IRBs and ethics boards, with explicit measures to protect participant privacy and reduce re-identification risk.

Furthermore, 1000 Genomes Project data are released into the public domain and are explicitly available without restriction or requirement for attribution, in alignment with the Fort Lauderdale and Toronto principles. This satisfies Dryad’s requirement for CC0-licensed data. As per the International Genome Sample Resource (IGSR), where 1000 Genomes Project data are stored, the data can be reused freely in any context without conditions.