A geographic history of human genetic ancestry

Grundler, Michael 1 ; Terhorst, Jonathan1 ; Bradburd, Gideon1

Published Mar 10, 2025 on Dryad. https://doi.org/10.5061/dryad.p5hqbzkwz

Data files

Mar 10, 2025 version files 5.34 GB

data.tar.gz
5.34 GB
README.md
10.46 KB

Abstract

Describing the distribution of genetic variation across individuals is a fundamental goal of population genetics. We present a method that capitalizes on the rich genealogical information encoded in genomic tree sequences to infer the geographic locations of the shared ancestors of a sample of sequenced individuals. We use this method to infer the geographic history of genetic ancestry of a set of human genomes sampled from Europe, Asia, and Africa, accurately recovering major population movements on those continents. Our findings demonstrate the importance of defining the spatial-temporal context of genetic ancestry to describing human genetic variation and caution against the oversimplified interpretations of genetic data prevalent in contemporary discussions of race and ancestry.

https://doi.org/10.5061/dryad.p5hqbzkwz

Description of the data and file structure

Data and codes for testing and fitting GAIA

Files and variables

File: data.tar.gz

Description: A gzip compressed archive with the following contents:

The slim directory contains simulation scripts and outputs for testing GAIA with the SLiM software in continuous and discrete geographic space. Each subdirectory corresponds to a different geographic space representation. The following subdirectories and files include:

continuous-space/ : Simulations on an abstract continuous plane
- uniform-landscape/ : SLiM simulations with spatially homogeneous carrying capacity
  - gaussian-dispersal/ : simulations with a Gaussian dispersal kernel
    - analysis/
      - performance.R, performance.sh : R script and shell script used to run GAIA on each of the simulation tree sequences that outputs ancestor-estimates.csv, rate-estimates.csv, timings-gaia.txt.
        
        ancestor-estimates.csv : A csv file with the dispersal variance (col 1), simulation replicate (col 2), internal node index (col 3), internal node time (col 4), and error (col 5) in GAIA's location estimate for the node.
        
        rate-estimates.csv : A csv file with the dispersal variance (col 1), simulation replicate (col 2), GAIA's dispersal rate estimate (col 3), and the true rate for the simulation (col 4).
        
        timings-gaia.txt : Each row is the time (in seconds) it took for GAIA to complete the simulation.
      - performance-wohns.py, performance-wohns.sh : Python script and shell script used to run the Wohns estimator on each of the simulation tree sequences that outputs ancestor-estimates-wohns.csv, timings-wohns.txt. The structure of these files is as for above.
      - figs/ : R scripts for producing figure output that appears in the manuscript.
    - simulations/
      - run.slim : Eidos code that implements the simulation.
      - slim.sh : Bash script that invokes run.slim for different dispersal strengths.
      - trees/
        
        tree-S##-R##.trees.tsz : tszip compressed tree sequences from each simulation generated by run.slim. In the naming convention S##-R##, the first set of characters following the S is the variance of the dispersal kernel and the second set of characters following the R is the simulation replicate.
      - figs/ : R scripts for producing figure output that appears in the manuscript.
  - laplace-dispersal/ : simulations with a Laplace dispersal kernel. The directory structure and file contents are as above.
- heterogeneous-landscape/ : SLiM simulations with spatially varying carrying capacity
  - analysis/ : R scripts and shell scripts used to run GAIA on each of the simulation tree sequences. See above for further description.
  - simulations/
    - run.slim, run-pareto.slim : Eidos code used to implement the simulations. The first file uses a Gaussian dispersal kernel; the second file, a Pareto dispersal kernel. The first file saves tree sequences to ./trees/tree-##.trees.tsz and the second to ./trees-pareto/tree-##.trees.tsz, where the ## characters are replaced by the index of the simulation landscape.
    - slim.sh : Shell script to invoke the above scripts to run the simulations on each of the simulated landscapes with spatially varying carrying capacity.
    - sim-landscape.R : R script used to generate spatially varying carrying capacities under a Gaussian random field.
    - landscapes/
      - landscape-##.csv : Each file contains matrix of carrying capacities on a 100-by-100 spatial grid produced by sim-landscape.R.
    - figs/ : R scripts for producing figure output that appears in the manuscript.
discrete-space/ : Simulations on a spatial grid representation of the African and Eurasian landmasses
- analysis/
  - ooa.R, ooa.sh : R script and bash script to run GAIA on each simulation tree sequence. Results are saved to mpr/ and flux/ directories.
  - true-flux.R : R script to calculate true ancestry flux coefficients for each simulation.
  - flux/
    - ooa-north-flux-##.csv, ooa-south-flux-##.csv, ooa-both-flux-##.csv : Each csv file contains the true ancestry flux out of Africa (col 1) and the estimated ancestry flux out of Africa (col 2). In the naming convention, the ## characters are replaced by the simulation replicate and the north, south, both infixes indicate which landscape the simulation was performed on.
  - mpr/
    - ooa-north-mpr-##.csv, ooa-south-mpr-##.csv, ooa-both-mpr-##.csv : Each csv contains the node index (col 1), the true geographic state of the node (col 2), and the estimated geographic state of the node (col 3). In the naming convention, the ## characters are replaced by the simulation replicate and the north, south, both infixes indicate which landscape the simulation was performed on.
  - figs/ : R scripts for producing figure output that appears in the manuscript.
- simulations/
  - landgrid-adjmat-ooa-north.csv, landgrid-adjmat-ooa-south.csv, landgrid-adjmat-ooa-both.csv : Adjacency matrices that indicate if spatial grid cells are adjacent (1) or not (0). These three files are identical except for a few connections between Africa and Eurasia. The first only allows dispersal out of Africa through a northern route; the second, through a southern route; the third, both routes.
  - run.slim, slim.sh : The first file is an Eidos script that implements the simulation and the second is a bash script that invokes the simulation on the different spatial grid representations.
  - trees/
    - tree-ooa-north-##.trees.tsz, tree-ooa-south-##.trees.tsz, tree-ooa-both-##.trees.tsz : tszip compressed tree sequences from each simulation. In the naming convention, the ## characters are replaced by the simulation replicate and the north, south, both infixes indicate which landscape the simulation was performed on.

The empirical/ directory contains code for running GAIA on a subset of tree sequences from the HGDP dataset. The following subdirectories and files include:

analysis/
- ancestry.R, ancestry-subsets.R : R scripts to compute ancestry coefficients for the full data and for the data subsets using the computed parsimony statistics.
- flux.R, flux-subsets.R : R scripts to compute ancestry flux coefficients for the full data and for the data subsets using the computed parsimony statistics.
- mpr.R, mpr-subsets.R : R scripts to compute tree sequence parsimony statistics for the full data and for the data subsets.
- geoarg.R : R script to geo-reference nodes and edges in the tree sequence using the computed parsimony statistics.
- results/ : Directory containing output from above scripts
  - mpr-chr18p.rds : Result from mpr.R obtained by running GAIA on the chromosome 18 tree sequence for all samples.
  - mpr-chr18p-subset-###-.rds : Results from mpr-subsets.R obtained by running GAIA on the ### random subset of samples from the chromosome 18 tree sequence in the data/trees/chr19p-subset-###.trees file.
  - flux-chr18p.rds, flux-thru-time-chr18p.rds : Results from flux.R. The first file contains ancestry flux coefficients obtained from the entire sample in a single temporal bin covering the most recent 20000 generations; the second file contains ancestry flux coefficients for the entire sample in temporal bins of 100 generations covering the most recent 20000 generations.
  - flux-subset-avg-chr18p.rds, flux-thru-time-subset-avg-chr18p.rds : Results from flux-subsets.R. As for above except that the results are averaged over all data/trees/chr18p-subset-###.trees files.
  - ancestry-chr18p.rds, ancestry-thru-time-chr18p.rds : Results from ancestry.R. The first file contains ancestry coefficients obtained from the entire sample in temporal bins of 500 generations back to the root of the tree sequence; the second file contains ancestry coefficients for continental subsets of samples in temporal bins of 100 generations covering the most recent 20000 generations.
  - ancestry-thru-time-subset-avg-chr18p.rds : Results from ancestry-subsets.R. As for ancestry-thru-time-chr18p.rds above but obtained by averaging over all data/trees/chr18p-subset-###.trees files.
  - georef-arg.csv : Result from geoarg.R. The csv file contains a migration history for each edge in the tree sequence detailing the geographic state of the edge and the time it first entered that state.
data/
- data.R, make-data.R : R scripts needed to generate the tree sequence data used by analysis/. The output from these scripts is saved to data/trees/chr18p.trees and data/trees/subsets/chr18p-subset-###.trees.
- land.gpkg, landgrid.gpkg : The first file contains spatial polygon data for Africa and Eurasia land masses and the second file contains a gridded representation of the same.
- landgrid-adjmat.csv, landgrid-distmat.csv : The first file is an adjacency matrix indicated whether the spatial grid cells in landgrid.gpkg are adjacent (1) or not (0); the second file is a distant matrix containing the length of the shortest path separating each pair of grid cells in landgrid.gpkg.
- make-landgrid-distmat.R : R script used to generate landgrid-distmat.csv.
- TM_WORLD_BORDERS-0.1.gpkg : Spatial polygon data with continent, country, region, etc. boundaries.
- make-sample-region.R : R script used to determine the continental region of each sample from its coordinates.
- proj.R : Useful coordinate reference system description strings for transformations among coordinate systems.
figs/ : R scripts for producing figure output that appears in the manuscript.

Code/software

The GAIA R package from https://github.com/blueraleigh/gaia is required. Please consult its documentation for understanding the objects saved to the .rds files described above.

Access information

Other publicly accessible locations of the data:

https://github.com/blueraleigh/gaia-paper