Signals interpreted as archaic introgression are driven primarily by accelerated evolution in Africa
Amos, William (2020), Signals interpreted as archaic introgression are driven primarily by accelerated evolution in Africa, Dryad, Dataset, https://doi.org/10.5061/dryad.2fqz612kn
Non-African humans appear to carry a few percent archaic DNA due to ancient inter-breeding. This modest legacy and its likely recent timing imply that most introgressed fragments will be rare and hence will occur mainly in the heterozygous state. I tested this prediction by calculating D statistics, a measure of legacy size, for pairs of humans where one of the pair was conditioned always to be either homozygous or heterozygous. Using coalescent simulations, I confirmed that conditioning the non-African to be heterozygous increased D while conditioning the non-African to be homozygous reduced D to zero. Repeating with real data reveals the exact opposite pattern. In African – non-African comparisons, D is near-zero if the African individual is held homozygous. Conditioning one of two Africans to be either homozygous or heterozygous invariably generates large values of D, even when both individuals are drawn from the same population. Invariably, the African with more heterozygous sites (conditioned heterozygous > unconditioned > conditioned homozygous) appears less related to the archaic. In contrast, the same analysis applied to pairs of non-Africans always yields near-zero D, showing that conditioning does not create large D without an underlying signal to expose. Large D values in humans are therefore driven almost entirely by heterozygous sites in Africans acting to increase divergence from related taxa such as Neanderthals. In comparison with heterozygous Africans, individuals that lack African heterozygous sites, whether non-African or conditioned homozygous African, always appear more similar to archaic outgroups, a signal previously interpreted as evidence for introgression. I hope these analyses will encourage others to consider increased divergence as well as increased similarity to archaics as mechanisms capable of driving asymmetrical base-sharing.
These data are parsed versions of data derived from publicly available data. For the Altai Neanderthal, I used simple C++ scripts applied to data available at http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/VCF/ to extract the minimal information required for my analyses: base location, human reference base and the four base counts, A, C, G , T in that order. For the chimpanzee-human alignment, I downloaded raw alignments from http://www.ensembl.org/info/data/ftp/ and then extracted bases and aligned these to the human reference sequence hs37d5 used by the 1000 genomes project. I excluded sites within 30 of either end of a contig. Since most contigs are many tens of Kb long, the loss of information is minimal but at the same time, possible issues with alignment edge effects should be avoided.
An annotated example of my C++ code is included as a supplementary file to the paper. This can be pasted directly into, for example, Visual Studio. The user will also need to download the equivalent chromosome vcf files from the 1000 genomes project site, decompressing them and renaming them 'ALL.chr[chromosome number].vcf', and edit my code so that the paths are appropriate for the computer on which the code is run.