Neandertal ancestry through time: Insights from genomes of ancient and present-day humans
Data files
Nov 26, 2024 version files 1.20 GB
-
ALL_called_ancestry_segments.csv.zip
244.15 MB
-
ALL_NEA_bins_PP_AA_map.csv.zip
11.51 MB
-
ALL_NEA_bins_PP_Shared_map.csv.zip
27.70 MB
-
Ancestry_Covariance_Data_Shared_map.zip
3.49 MB
-
Meta_Data_individuals.csv
494.02 KB
-
Neandertal_segments_matching_references_Shared_map.csv.zip
41.43 MB
-
non_callable_bins_AA_map.csv
250.96 KB
-
non_callable_bins_Shared_map.csv
542.08 KB
-
non_callable_regions_in_physical_positions.csv
33.30 KB
-
README.md
22.08 KB
-
ref_archaicadmixtureX_hs37mMask35to99.csv.xz
39.15 MB
-
SNP_files_Shared_map.zip
830.73 MB
Abstract
Gene flow from Neandertals has shaped the landscape of genetic and phenotypic variation in modern humans. We identify the location and size of introgressed Neandertal ancestry segments in more than 300 genomes spanning the last 50,000 years. We study how Neandertal ancestry is shared among individuals to infer the time and duration of the Neandertal gene flow. We find the correlation of Neandertal segment locations across individuals and their divergence to sequenced Neandertals, both support a model of a single major Neandertal gene flow. Our catalog of introgressed segments through time confirms that most natural selection–positive and negative–on Neandertal ancestry variants occurred immediately after the gene flow, and provides new insights into how contact with Neandertals shaped human origins and adaptation.
https://doi.org/10.5061/dryad.zw3r228gg
In this study, we used 59 ancient and 2118 present-day modern human genomes. We called segments in the genome that are of Neandertal and Denisovan ancestry. In this data repository are all the files used for the analysis of the study. For a detailed description of how the data was generated see the supplementary section of the study.
Description of the data and file structure
In this data repository are 11 files in total. Most files are given or contain data for admixfrog runs. The genetic distances given in these files are from two different empirical genetic maps. Either from the Shared genetic map (these do not contain data on the X chromosome) or from the African-American map (this genetic map is specific to the African hot spot and not very precise for Eurasian populations but it contains the X chromosome). Four files are not compressed this are:
- Meta_Data_individuals.csv: this file is a csv file containing all meta information on the individuals. The columns are as follows:
sample_name
: ID of the individualave_cont
: the average contamination of the sequenced genomes estimated by admixfrog on archaic admixture sites (autosomes only). This is only important for ancient individuals.ave_cont_error
: the error in the estimate of contamination.median cov
: the median coverage throughout the sequenced genome (coverage = how often on average is a site in the genome read by the sequencer) on archaic admixture sites (autosomes only). This is only important for ancient individuals.n_sites_cov
: How many of the archaic admixture sites are present in the genome. This is only important for ancient individuals.pop
: Population names from the SGDP or 1000Genomes project. Ancient individuals are given the population names by their sample location.superpopulation
: same as pop but for continental populationsSex
: Biological sex of the individual.Latitude
: sample location coordinate. 1000 Genomes project coordinates are not always the sample location but place of recent ancestry, e.g. Gujarati Indians in Houston is not Houston but the rough coordinates of the region this population originated from.Latitude
: other coordinate else same as Latitude.time
: indicates if the individual is ancient or present-dayave_cov_array
: the average coverage on the ascertained sites. This is only important for ancient individuals.Data_source
: Specifies if the data was captured (and what array), shotgun sequenced, or genotypes were used.ML_BP_Mean
: The mean of the most likely date the individual was alive.estimate_all_f4r_NEA
: the amount of Neandertal ancestry in percent estimated using the f4 ratio test on archaic admixture sites (autosomes only) (point estimate).estimate_all_admixfrog_DEN
: the amount of Denisovan ancestry in percent estimated using admixfrog on archaic admixture sites (autosomes only).estimate_all_admixfrog_NEA
: the amount of Neandertal ancestry in percent estimated using admixfrog on archaic admixture sites (autosomes only).lower_all_admixfrog_DEN
: the lower value of the amount of Denisovan ancestry in percent estimated using admixfrog on archaic admixture sites (autosomes only).lower_all_admixfrog_NEA
: the lower value of the amount of Neandertal ancestry in percent estimated using admixfrog on archaic admixture sites (autosomes only).upper_all_admixfrog_DEN
: the upper value of the amount of Denisovan ancestry in percent estimated using admixfrog on archaic admixture sites (autosomes only).upper_all_admixfrog_NEA
: the upper value of the amount of Neandertal ancestry in percent estimated using admixfrog on archaic admixture sites (autosomes only).ML_BP_Lower
: The lower value of the most likely date the individual was alive.ML_BP_Higher
: The higher value of the most likely date the individual was alive.estimate_deam_f4r_NEA
: the amount of Neandertal ancestry in percent estimated using the f4 ratio test only on archaic admixture sites for sequences showing signs of deamination (autosomes only) (point estimate). This is only important for ancient individuals.estimate_all_f4r_gtLH_NEA
: the amount of Neandertal ancestry in percent estimated using the f4 ratio test on contamination corrected genotype likelihoods from admixfrog on archaic admixture sites (autosomes only) (point estimate). This is only important for ancient individuals.- Cov: Overall coverage (coverage = how often on average is a site in the genome read by the sequencer) of the individuals genome. Estimated for ancient individuals on archaic admixture sites (autosomes only). Taken from the literature fr present-day individuals.
DataType
: type of the sequencing either shotgun or captured.x_chrom_captured
: logical value if the X chromosome is available or not.estimate_all_f4r_NEA_1240k
: the amount of Neandertal ancestry in percent estimated using the f4 ratio test on 1240k capture array sites (autosomes only) (point estimate).estimate_all_admixfrog_DEN_1240k
: the amount of Denisovan ancestry in percent estimated using admixfrog on 1240k capture array sites (autosomes only) (point estimate).estimate_all_admixfrog_NEA_1240k
: the amount of Neandertal ancestry in percent estimated using admixfrog on 1240k capture array sites (autosomes only) (point estimate).estimate_deam_f4r_NEA_1240k
: the amount of Neandertal ancestry in percent estimated using the f4 ratio test on 1240k capture array sites and only for sequences showing signs of deamination (autosomes only) (point estimate). This is only important for ancient individuals.estimate_all_f4r_gtLH_NEA_1240k
: the amount of Neandertal ancestry in percent estimated using the f4 ratio test on contamination corrected genotype likelihoods from admixfrog on 1240k capture array sites (autosomes only) (point estimate). This is only important for ancient individuals.prop_callable
: the proportion of sites on the archaic admixture array that are covered (autosomes only).population_cluster
: the names of the population cluster as identified in the study the individual is part of.superpopulation_cluster
: the names of the continental cluster as identified in the study the individual is part of.
- The three files starting with ‘non_callable_regions_in’ are mask files that give regions where we were not able to call ancestries since the ascertainment did not sample there. The files are:
- non_callable_regions_in_physical_positions.csv that gives all regions where there is no sampled position in the ascertainment in a 20 kb window along the genome. With columns:
1
: chromosome2
: start position in base pairs (bp)3
: end position in base pairs (bp)- non_callable_bins_Shared_map.csv and non_callable_bins_AA_map.csv files give the admixfrog-defined bins on the African-American or Shared map that fall into these regions. With columns:
1
: chromosome the bin is on2
: start position in centiMorgan (cM)3
: end position in centiMorgan (cM)
The zip files are:
- ALL_NEA_bins_PP_Shared_map.csv.zip containing the admixfrog output for all individuals (columns) throughout the genome that is binned in bins of 0.005 cM (rows). The Shared map gives the result using the Shared genetic map and ALL_NEA_bins_PP_AA_map.csv.zip for the African-American map. With columns:
chrom
: chromosome the bin is onmap
: start position of the bin in genetic position in cMpos
: start position in base pairsid
: ID of the bin given by admixfrogcol 5 - end
: For each individual the posterior probability for Neandertal ancestry for bins that are on a called Neandertal segment that is at least 0.2/0.05 cM long in ancient/present-day individuals. If that is not the case all posterior probability values are put to zero.
- ALL_called_ancestry_segments.csv.zip is one csv file that contains all called segments for each ancestry (Neandertal, Denisovan, African) from admixfrog for all types which are homozygous ancestry (both sister chromatids have the same ancestry), heterozygous (both sister chromatids have different ancestry) or “state” which does not differentiate between homo and heterozygous ancestry (this is mainly used in the study). With columns:
chrom
: chromosomestart
: ID of the bin that is at the start of the segmentend
: ID of the bin that is at the end of the segmentscore
: numerical score giving certainty of segmenttarget
: ancestry of the segmenttype
: either the ancestry is homozygous or heterozygous or a state which disregards the two sister chromatids.map
: start of the segment in cMpos
: start of the segment in bpid
: same as startmap_end
: end of the segment in cMpos_end
: end of the segment in bpid_end
: same as endlen
: number of bins that the segment is composed of.map_len
: length of the segment in cMpos_len
: length of the segment in bpnscore
: numerical score giving certainty of segment normalized by bin sizen_all_snps
: number of ascertained sites on the segmentall_n_AFR
: number of ascertained sites on the segment matching the African referenceall_n_NEA
: number of ascertained sites on the segment matching the Neandertal referenceall_n_DEN
: number of ascertained sites on the segment matching the Denisovan referencefrag_ID
: name of the segmentsample
: ID of the individual the segment is found ingenetic_map
: what genetic map was used to run admixfrog (either Shared or African-American map)
- Neandertal_segments_matching_references_Shared_map.csv.zip is one csv file that has all Neandertal called segments longer than 0.2/0.05 cM for ancient/present-day individuals and their matching to the reference Neandertal, Denisovan, and Mbuti individuals. This file does not include the 1000 Genomes individuals. With columns:
col 1 - 21
: same as in ALL_called_ancestry_segments.csv.zipall_reads
: the number of sequencing reads (for ancient individuals) or the number of alleles (for present-day individuals) overlapping the segment.prop_matching_Mbuti
: the proportion of shared derived reads/alleles matching the Mbuti reference individual divided by the all_reads.prop_matching_Vindija33.19
: the proportion of shared derived reads/alleles matching the Vindija Neandertal reference individual divided by the all_reads.prop_matching_Altai
: the proportion of shared derived reads/alleles matching the Altai Neandertal reference individual divided by the all_reads.prop_matching_Chagyrskaya
: the proportion of shared derived reads/alleles matching the Chagurskaya Neandertal reference individual divided by the all_reads.prop_matching_Denisova
: the proportion of shared derived reads/alleles matching the Denisovan reference individual divided by the all_reads.prop_pHcount_matching_Mbuti
: the proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Mbuti reference individual.count_matching_Mbuti
: the number of shared derived reads/alleles matching the Mbuti reference individual.count_Mbuti
: number of derived sites of the Mbuit overlapping the segment.prop_pHcount_matching_Vindija33.19
: the proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Vindija Neandertal reference individual.count_matching_Vindija33.19
: the number of shared derived reads/alleles matching the Vindija Neandertal reference individual.count_Vindija33.19
: number of derived sites in the Vindija reference genome overlapping the segment.prop_pHcount_matching_Altai
: the proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Altai Neandertal reference individual.count_matching_Altai
: the number of shared derived reads/alleles matching the Altai Neandertal reference individual.count_Altai
: number of derived sites in the Altai reference genome overlapping the segment.prop_pHcount_matching_Chagyrskaya
: the proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Chagyrskaya Neandertal reference individual.count_matching_Chagyrskaya
: the number of shared derived reads/alleles matching the Chagyrskaya Neandertal reference individual.count_Chagyrskaya
: number of derived sites in the Chagyrskaya reference genome overlapping the segment.prop_pHcount_matching_Denisova
: the proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Denisovan reference individual.count_matching_Denisova
: the number of shared derived reads/alleles matching the Denisovan reference individual.count_Denisova
: number of derived sites in the Denisovan reference genome overlapping the segment.BT_prop_pHcount_matching_Mbuti_mean
: the mean proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Mbuti reference individual from replicated 100 times.BT_prop_pHcount_matching_Mbuti_min
: the minimum obtained proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Mbuti reference individual from replicated 100 times.BT_prop_pHcount_matching_Mbuti_max
: the maximum obtained proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Mbuti reference individual from replicated 100 times.BT_prop_pHcount_matching_Mbuti_sd
: the standard deviation of the proportions of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Mbuti reference individual from replicated 100 times.BT_prop_pHcount_matching_Vindija33.19_mean
: the mean proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Vindija Neandertal reference individual from replicated 100 times.- BT_prop_pHcount_matching_Vindija33.19_min`: the minimum obtained proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Vindija Neandertal reference individual from replicated 100 times.
BT_prop_pHcount_matching_Vindija33.19_max
: the maximum obtained proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Vindija Neandertal reference individual from replicated 100 times.BT_prop_pHcount_matching_Vindija33.19_sd
: the standard deviation of the proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Vindija Neandertal reference individual from replicated 100 times.BT_prop_pHcount_matching_Altai_mean
: the mean proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Altai Neandertal reference individual from replicated 100 times.BT_prop_pHcount_matching_Altai_min
: the minimum obtained proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Altai Neandertal reference individual from replicated 100 times.BT_prop_pHcount_matching_Altai_max
: the maximum obtained proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Altai Neandertal reference individual from replicated 100 times.BT_prop_pHcount_matching_Altai_sd
: the standard deviation of the proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Altai Neandertal reference individual from replicated 100 times.BT_prop_pHcount_matching_Chagyrskaya_mean
: the mean proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Chagyrskaya Neandertal reference individual from replicated 100 times.BT_prop_pHcount_matching_Chagyrskaya_min
: the minimum obtained proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Chagyrskaya Neandertal reference individual from replicated 100 times.BT_prop_pHcount_matching_Chagyrskaya_max
: the maximum obtained proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Chagyrskaya Neandertal reference individual from replicated 100 times.BT_prop_pHcount_matching_Chagyrskaya_sd
: the standard deviation of the proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Chagyrskaya Neandertal reference individual from replicated 100 times.BT_prop_pHcount_matching_Denisova_mean
: the mean proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Denisovan reference individual from replicated 100 times.BT_prop_pHcount_matching_Denisova_min
: the minimum obtained proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Denisovan reference individual from replicated 100 times.BT_prop_pHcount_matching_Denisova_max
: the maximum obtained proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Denisovan reference individual from replicated 100 times.BT_prop_pHcount_matching_Denisova_sd
: the standard deviation of the proportion of shared derived reads/alleles that are randomly sampled per site (pseudo-haplodized) matching the Denisovan reference individual from replicated 100 times.sample
: ID of the individual the segment is found in.Sites_used
: either the estimates are based on all available sites on the segment or only sites overlapping the archaic admixture sites.
- Ancestry_Covariance_Data_Shared_map.zip contains a folder that has for each ancient individual older than 20,000 years a subfolder that contains the ancestry covariance curved for all chromosomes and for all – 1 chromosome. The left-out chromosome is indicated at the end of the file name. These files were used for the dating of the project. The genetic distances are on the Shared map. All files have the following columns:
Bin(cM)
: distance between SNPs in centi Morgan the covariance is calculated forCovariance
: the estimated covariance between the SNPs
- The SNP_files_Shared_map.zip is a folder that contains the xz compressed file for each ancient individual older than 20,000 years. Each of the files gives the contamination corrected genotype likelihoods on the archaic admixture ascertainment calculated by admixfrog using the Shared map. These files were used for calculating the ancestry covariance curves together with the admixfrog reference file. The columns in the table are:
snp_id
: ID of the SNPtref
: number of reference reads at SNPtalt
: number of alternative reads at SNPchrom
: chromosome the SNP is onpos
: position of the SNP in base pairsmap
: position of the SNP in centimorganG0
: log10-likelihood of SNP being homozygous referenceG1
: log10-likelihood of SNP being heterozygousG2
: log10-likelihood of SNP being homozygous alternativep
: estimated allele frequency of derived allelerandom_read
: status of a random read samples given p (0 reference, 1 alternative)bin
: bin-id this SNP is in
- ref_archaicadmixtureX_hs37mMask35to99.csv.xz contains the admixfrog reference file on the archaic admixture + X array for the Altai, Vindija, and Chagyrskaya Neandertal, the Denisovan, and all female individuals from the 1k Genomes including Mende, Mandenka, Yoruba, Esan, and Luhya individuals (abbreviated AFR). This file is xz compressed. Columns are as follows:
chrom
: chromosome the SNP is onpos
: position of the SNP in base pairsref
: the base of the reference allele at the positionalt
: the base of the alternative allele at the positionmap
: position of the SNP in centimorganAA_Map
: genetic coordinates from the African-American mapdeCODE
: genetic coordinates from the deCODE mapYRI_LD
: genetic coordinates from the Yoruba LD-based mapCEU_LD
: genetic coordinates from the European (CEU) LD-based mapShared_Map
: genetic coordinates from the Shared mapAFR_ref
: the number of reference alleles observed in the African individualsAFR_alt
: the number of reference alleles observed in the African individualsALT_ref
: the number of reference alleles observed in the Altai Neandertal individualALT_alt
: the number of alternative alleles observed in the Altai Neandertal individualCHA_ref
: the number of reference alleles observed in the Chagyrskaya Neandertal individualCHA_alt
: the number of alternative alleles observed in the Chagyrskaya Neandertal individualDEN_ref
: the number of reference alleles observed in the Denisovan individualDEN_alt
: the number of alternative alleles observed in the Denisovan individualNEA_ref
: the number of reference alleles observed in the three Neandertal individualsNEA_alt
: the number of alternative alleles observed in the three Neandertal individualsPAN_ref
: the number of reference alleles observed in the haploid Chimp reference genome (panTro6)PAN_alt
: the number of alternative alleles observed in the haploid Chimp reference genome (panTro6)VIN_ref
: the number of reference alleles observed in the Vindija Neandertal individualVIN_alt
: the number of alternative alleles observed in the Vindija Neandertal individual
We identified Neandertal introgressed segments in 59 ancient and 2118 present-day modern human individuals using an HMM approach called admixfrog.