A geographic history of human genetic ancestry
Data files
Mar 10, 2025 version files 5.34 GB
-
data.tar.gz
5.34 GB
-
README.md
10.46 KB
Abstract
Describing the distribution of genetic variation across individuals is a fundamental goal of population genetics. We present a method that capitalizes on the rich genealogical information encoded in genomic tree sequences to infer the geographic locations of the shared ancestors of a sample of sequenced individuals. We use this method to infer the geographic history of genetic ancestry of a set of human genomes sampled from Europe, Asia, and Africa, accurately recovering major population movements on those continents. Our findings demonstrate the importance of defining the spatial-temporal context of genetic ancestry to describing human genetic variation and caution against the oversimplified interpretations of genetic data prevalent in contemporary discussions of race and ancestry.
https://doi.org/10.5061/dryad.p5hqbzkwz
Description of the data and file structure
Data and codes for testing and fitting GAIA
Files and variables
File: data.tar.gz
Description: A gzip compressed archive with the following contents:
The slim
directory contains simulation scripts and outputs for testing GAIA with the SLiM software in continuous and discrete geographic space. Each subdirectory corresponds to a different geographic space representation. The following subdirectories and files include:
continuous-space/
: Simulations on an abstract continuous planeuniform-landscape/
: SLiM simulations with spatially homogeneous carrying capacitygaussian-dispersal/
: simulations with a Gaussian dispersal kernelanalysis/
performance.R, performance.sh
: R script and shell script used to run GAIA on each of the simulation tree sequences that outputsancestor-estimates.csv, rate-estimates.csv, timings-gaia.txt
.ancestor-estimates.csv
: A csv file with the dispersal variance (col 1), simulation replicate (col 2), internal node index (col 3), internal node time (col 4), and error (col 5) in GAIA’s location estimate for the node.rate-estimates.csv
: A csv file with the dispersal variance (col 1), simulation replicate (col 2), GAIA’s dispersal rate estimate (col 3), and the true rate for the simulation (col 4).timings-gaia.txt
: Each row is the time (in seconds) it took for GAIA to complete the simulation.
performance-wohns.py, performance-wohns.sh
: Python script and shell script used to run the Wohns estimator on each of the simulation tree sequences that outputsancestor-estimates-wohns.csv, timings-wohns.txt
. The structure of these files is as for above.figs/
: R scripts for producing figure output that appears in the manuscript.
simulations/
run.slim
: Eidos code that implements the simulation.slim.sh
: Bash script that invokesrun.slim
for different dispersal strengths.trees/
tree-S##-R##.trees.tsz
: tszip compressed tree sequences from each simulation generated byrun.slim
. In the naming conventionS##-R##
, the first set of characters following theS
is the variance of the dispersal kernel and the second set of characters following theR
is the simulation replicate.
figs/
: R scripts for producing figure output that appears in the manuscript.
laplace-dispersal/
: simulations with a Laplace dispersal kernel. The directory structure and file contents are as above.
heterogeneous-landscape/
: SLiM simulations with spatially varying carrying capacityanalysis/
: R scripts and shell scripts used to run GAIA on each of the simulation tree sequences. See above for further description.simulations/
run.slim, run-pareto.slim
: Eidos code used to implement the simulations. The first file uses a Gaussian dispersal kernel; the second file, a Pareto dispersal kernel. The first file saves tree sequences to./trees/tree-##.trees.tsz
and the second to./trees-pareto/tree-##.trees.tsz
, where the##
characters are replaced by the index of the simulation landscape.slim.sh
: Shell script to invoke the above scripts to run the simulations on each of the simulated landscapes with spatially varying carrying capacity.sim-landscape.R
: R script used to generate spatially varying carrying capacities under a Gaussian random field.landscapes/
landscape-##.csv
: Each file contains matrix of carrying capacities on a 100-by-100 spatial grid produced bysim-landscape.R
.
figs/
: R scripts for producing figure output that appears in the manuscript.
discrete-space/
: Simulations on a spatial grid representation of the African and Eurasian landmassesanalysis/
ooa.R, ooa.sh
: R script and bash script to run GAIA on each simulation tree sequence. Results are saved tompr/
andflux/
directories.true-flux.R
: R script to calculate true ancestry flux coefficients for each simulation.flux/
ooa-north-flux-##.csv, ooa-south-flux-##.csv, ooa-both-flux-##.csv
: Each csv file contains the true ancestry flux out of Africa (col 1) and the estimated ancestry flux out of Africa (col 2). In the naming convention, the##
characters are replaced by the simulation replicate and thenorth, south, both
infixes indicate which landscape the simulation was performed on.
mpr/
ooa-north-mpr-##.csv, ooa-south-mpr-##.csv, ooa-both-mpr-##.csv
: Each csv contains the node index (col 1), the true geographic state of the node (col 2), and the estimated geographic state of the node (col 3). In the naming convention, the##
characters are replaced by the simulation replicate and thenorth, south, both
infixes indicate which landscape the simulation was performed on.
figs/
: R scripts for producing figure output that appears in the manuscript.
simulations/
landgrid-adjmat-ooa-north.csv, landgrid-adjmat-ooa-south.csv, landgrid-adjmat-ooa-both.csv
: Adjacency matrices that indicate if spatial grid cells are adjacent (1) or not (0). These three files are identical except for a few connections between Africa and Eurasia. The first only allows dispersal out of Africa through a northern route; the second, through a southern route; the third, both routes.run.slim, slim.sh
: The first file is an Eidos script that implements the simulation and the second is a bash script that invokes the simulation on the different spatial grid representations.trees/
tree-ooa-north-##.trees.tsz, tree-ooa-south-##.trees.tsz, tree-ooa-both-##.trees.tsz
: tszip compressed tree sequences from each simulation. In the naming convention, the##
characters are replaced by the simulation replicate and thenorth, south, both
infixes indicate which landscape the simulation was performed on.
The empirical/
directory contains code for running GAIA on a subset of tree sequences from the HGDP dataset. The following subdirectories and files include:
analysis/
ancestry.R, ancestry-subsets.R
: R scripts to compute ancestry coefficients for the full data and for the data subsets using the computed parsimony statistics.flux.R, flux-subsets.R
: R scripts to compute ancestry flux coefficients for the full data and for the data subsets using the computed parsimony statistics.mpr.R, mpr-subsets.R
: R scripts to compute tree sequence parsimony statistics for the full data and for the data subsets.geoarg.R
: R script to geo-reference nodes and edges in the tree sequence using the computed parsimony statistics.results/
: Directory containing output from above scriptsmpr-chr18p.rds
: Result from mpr.R obtained by running GAIA on the chromosome 18 tree sequence for all samples.mpr-chr18p-subset-###-.rds
: Results from mpr-subsets.R obtained by running GAIA on the###
random subset of samples from the chromosome 18 tree sequence in thedata/trees/chr19p-subset-###.trees
file.flux-chr18p.rds, flux-thru-time-chr18p.rds
: Results fromflux.R
. The first file contains ancestry flux coefficients obtained from the entire sample in a single temporal bin covering the most recent 20000 generations; the second file contains ancestry flux coefficients for the entire sample in temporal bins of 100 generations covering the most recent 20000 generations.flux-subset-avg-chr18p.rds, flux-thru-time-subset-avg-chr18p.rds
: Results fromflux-subsets.R
. As for above except that the results are averaged over alldata/trees/chr18p-subset-###.trees
files.ancestry-chr18p.rds, ancestry-thru-time-chr18p.rds
: Results fromancestry.R
. The first file contains ancestry coefficients obtained from the entire sample in temporal bins of 500 generations back to the root of the tree sequence; the second file contains ancestry coefficients for continental subsets of samples in temporal bins of 100 generations covering the most recent 20000 generations.ancestry-thru-time-subset-avg-chr18p.rds
: Results fromancestry-subsets.R
. As forancestry-thru-time-chr18p.rds
above but obtained by averaging over alldata/trees/chr18p-subset-###.trees
files.georef-arg.csv
: Result fromgeoarg.R
. The csv file contains a migration history for each edge in the tree sequence detailing the geographic state of the edge and the time it first entered that state.
data/
data.R, make-data.R
: R scripts needed to generate the tree sequence data used byanalysis/
. The output from these scripts is saved todata/trees/chr18p.trees
anddata/trees/subsets/chr18p-subset-###.trees
.land.gpkg, landgrid.gpkg
: The first file contains spatial polygon data for Africa and Eurasia land masses and the second file contains a gridded representation of the same.landgrid-adjmat.csv, landgrid-distmat.csv
: The first file is an adjacency matrix indicated whether the spatial grid cells inlandgrid.gpkg
are adjacent (1) or not (0); the second file is a distant matrix containing the length of the shortest path separating each pair of grid cells inlandgrid.gpkg
.make-landgrid-distmat.R
: R script used to generatelandgrid-distmat.csv
.TM_WORLD_BORDERS-0.1.gpkg
: Spatial polygon data with continent, country, region, etc. boundaries.make-sample-region.R
: R script used to determine the continental region of each sample from its coordinates.proj.R
: Useful coordinate reference system description strings for transformations among coordinate systems.
figs/
: R scripts for producing figure output that appears in the manuscript.
Code/software
The GAIA R package from https://github.com/blueraleigh/gaia is required. Please consult its documentation for understanding the objects saved to the .rds
files described above.
Access information
Other publicly accessible locations of the data: