The MUC19 Gene: An Evolutionary History of Recurrent Introgression and Natural Selection
Data files
Aug 21, 2025 version files 145.65 GB
-
72kb_amr_beds.tar.gz
26.06 KB
-
afr_v_den_divergence_windows.tar.gz
185.73 KB
-
alt_heterozygosity_chromosome.tar.gz
1.39 KB
-
alt_heterozygosity_windows.tar.gz
43.93 KB
-
alt_masked_no_aa_window_info_and_eff_seq_len.tar.gz
474.72 KB
-
alt_masked_no_aa.tar.gz
5.06 MB
-
amr_asn_eur_pbs_windows.tar.gz
76.92 MB
-
annotations.tar.gz
8.78 MB
-
archaic_psuedo_ancestry_painting_windows.tar.gz
78.06 KB
-
archaic_site_patterns_windows.tar.gz
548.23 KB
-
arcs_masked_aa_window_info_and_eff_seq_len.tar.gz
1.39 MB
-
arcs_masked_aa.tar.gz
15.34 MB
-
arcs_masked_no_aa_window_info_and_eff_seq_len.tar.gz
1.40 MB
-
arcs_masked_no_aa.tar.gz
14.04 MB
-
bookkeeping.tar.gz
53.78 GB
-
cha_heterozygosity_chromosome.tar.gz
1.39 KB
-
cha_heterozygosity_windows.tar.gz
40.27 KB
-
cha_masked_no_aa_window_info_and_eff_seq_len.tar.gz
471.86 KB
-
cha_masked_no_aa.tar.gz
4.99 MB
-
cha_phased_ref_panel_all_inds.tar.gz
3.74 KB
-
classify_tgp_snps_chromosome.tar.gz
2.92 GB
-
den_heterozygosity_chromosome.tar.gz
1.40 KB
-
den_heterozygosity_windows.tar.gz
52.04 KB
-
den_masked_no_aa_window_info_and_eff_seq_len.tar.gz
479.31 KB
-
den_masked_no_aa.tar.gz
5.31 MB
-
den_v_alt_divergence_windows.tar.gz
66.71 KB
-
mxl_chb_ceu_pbs_chromsome.tar.gz
145.48 MB
-
mxl_chb_ceu_pbs_windows.tar.gz
1.02 MB
-
mxl_slimulations.tar.gz
295.17 MB
-
negative_recomb_map_slimulated_vcfs.tar.gz
2.77 GB
-
neutral_recomb_map_slimulated_vcfs.tar.gz
2.84 GB
-
positive_neutral_s0015_recomb_map_slimulated_vcfs.tar.gz
259.70 MB
-
positive_neutral_s01_recomb_map_slimulated_vcfs.tar.gz
259.48 MB
-
positive_neutral_s1_recomb_map_slimulated_vcfs.tar.gz
251.17 MB
-
README.md
74.84 KB
-
sgdp_alt_masked_no_aa_eff_seq_len.tar.gz
254 B
-
sgdp_alt_masked_no_aa.tar.gz
653.20 MB
-
sgdp_arcs_masked_no_aa.tar.gz
728.90 MB
-
sgdp_cha_masked_no_aa_eff_seq_len.tar.gz
260 B
-
sgdp_cha_masked_no_aa.tar.gz
644.74 MB
-
sgdp_den_masked_no_aa_eff_seq_len.tar.gz
258 B
-
sgdp_den_masked_no_aa.tar.gz
650.30 MB
-
sgdp_denisovan_intro_tracts_in_papuans_qc_info.tar.gz
90.72 KB
-
sgdp_denisovan_sequence_divergence_at_denisovan_intro_tracts_in_papuans.tar.gz
94.95 KB
-
sgdp_vin_masked_no_aa_eff_seq_len.tar.gz
258 B
-
sgdp_vin_masked_no_aa.tar.gz
651.16 MB
-
short_read_repeat_amr_beds.tar.gz
25.44 KB
-
sprime_sites_mxl_chb_ceu_pbs_windows.tar.gz
88.75 KB
-
tgp_alt_masked_aa_window_info_and_eff_seq_len.tar.gz
566.28 KB
-
tgp_alt_masked_aa.tar.gz
4.66 GB
-
tgp_alt_masked_no_aa_window_info_and_eff_seq_len.tar.gz
568.90 KB
-
tgp_alt_masked_no_aa.tar.gz
4.73 GB
-
tgp_archaic_psuedo_ancestry_painting_windows.tar.gz
130.89 KB
-
tgp_archaic_snp_denisty_windows.tar.gz
1.21 MB
-
tgp_arcs_masked_no_aa_window_info_and_eff_seq_len.tar.gz
1.70 MB
-
tgp_arcs_masked_no_aa.tar.gz
5.04 GB
-
tgp_cha_masked_aa_window_info_and_eff_seq_len.tar.gz
567.04 KB
-
tgp_cha_masked_aa.tar.gz
4.59 GB
-
tgp_cha_masked_no_aa_window_info_and_eff_seq_len.tar.gz
569.77 KB
-
tgp_cha_masked_no_aa.tar.gz
4.67 GB
-
tgp_den_masked_aa_window_info_and_eff_seq_len.tar.gz
566.06 KB
-
tgp_den_masked_aa.tar.gz
4.63 GB
-
tgp_den_masked_no_aa_window_info_and_eff_seq_len.tar.gz
568.93 KB
-
tgp_den_masked_no_aa.tar.gz
4.70 GB
-
tgp_hap_v_arc_dip_divergence_windows.tar.gz
819.25 MB
-
tgp_hap_v_arc_psuedo_hap_divergence_windows.tar.gz
352.94 MB
-
tgp_heterozygosity_chromosome.tar.gz
148.63 KB
-
tgp_heterozygosity_windows.tar.gz
32.95 MB
-
tgp_ihs_tables.tar.gz
3.33 GB
-
tgp_mod_aa.tar.gz
7.50 GB
-
tgp_mod_no_aa_window_info_and_eff_seq_len.tar.gz
499.72 KB
-
tgp_mod_no_aa.tar.gz
7.89 GB
-
tgp_q95_u30_afr_b_den_windows.tar.gz
1.14 MB
-
tgp_selscan_ihs_output.tar.gz
5.47 GB
-
tgp_selscan_maps.tar.gz
4.12 GB
-
tgp_selscan_vcfs.tar.gz
6.74 GB
-
tgp_u30_afr_b_den_genes.tar.gz
5.97 KB
-
tgp_u30_afr_b_den_windows.tar.gz
19.40 KB
-
tgp_vin_masked_aa_window_info_and_eff_seq_len.tar.gz
566.49 KB
-
tgp_vin_masked_aa.tar.gz
4.64 GB
-
tgp_vin_masked_no_aa_window_info_and_eff_seq_len.tar.gz
569.11 KB
-
tgp_vin_masked_no_aa.tar.gz
4.72 GB
-
vin_heterozygosity_chromosome.tar.gz
1.40 KB
-
vin_heterozygosity_windows.tar.gz
47.64 KB
-
vin_masked_no_aa_window_info_and_eff_seq_len.tar.gz
477.65 KB
-
vin_masked_no_aa.tar.gz
5.16 MB
-
vin_phased_ref_panel_all_inds.tar.gz
3.75 KB
-
yri_na19664_archaic_site_patterns_windows.tar.gz
3.77 MB
Abstract
We study the gene MUC19, for which some modern humans carry a Denisovan-like haplotype. MUC19 is a mucin, a glycoprotein that forms gels with various biological functions. We find diagnostic variants for the Denisovan-like MUC19 haplotype at high frequencies in admixed Latin American individuals, and at highest frequency in 23 ancient Indigenous American individuals, all predating population admixture with Europeans and Africans. We find that the Denisovan-like MUC19 haplotype is under positive selection and carries a higher copy number of a 30 base-pair variable number tandem repeat, and that copy numbers of this repeat are exceedingly high in American populations. Finally, some Neanderthals carry the Denisovan-like MUC19 haplotype, and that it was likely introgressed into human populations through Neanderthal introgression rather than Denisovan introgression.
Citation: Please cite Villanea and Peede et al. 2025 (https://doi.org/10.1126/science.adl0882) when using this data.
To use the data, you will first need to extract the individual data archives (.tar.gz). For example, run tar -xf {data_archive}.tar.gz. After extraction, you may need to edit the code on GitHub so that the paths are appropriate for the computer on which the code is run. The paths provided in this README reflect my original project directory structure.
Below is a quick reference of the naming conventions I used throughout the project.
Naming Conventions
- Archaic Individuals
den: Altai Denisovanalt: Altai Neanderthalcha: Chagyrskaya Neanderthal- This individual is considered a late Neanderthal.
vin: Vindija Neanderthal- This individual is considered a late Neanderthal.
- Modern Human Datasets
tgp: 1000 Genomes Projectsgdp: Simon's Genome Diversity Project
- 1000 Genomes Project Superpopulations & Populations
afr: Africans- Populations:
yri,lwk,gwd,msl,esn
- Populations:
amr: Admixed Americans- Populations:
mxl,pel,clm,pur
- Populations:
sas: South Asians- Populations:
beb,stu,itu,pjl,gih
- Populations:
eas: East Asians- Populations:
chb,khv,chs,jpt,cdx
- Populations:
eur: Europeans- Populations:
tsi,ceu,ibs,gbr,fin
- Populations:
Data Description
Note on Paths: The file paths listed below correspond to the directory structure used in the original study. After extracting the archives, you may need to organize the files accordingly or adjust paths in the analysis scripts.
amr_lai
72kb_amr_beds.tar.gz & short_read_repeat_amr_beds.tar.gz
Both 72kb_amr_beds.tar.gz and short_read_repeat_amr_beds.tar.gz contain BED files, which are the output from the bedtools code found here.
Paths
72kb_amr_beds.tar.gz./muc19/amr_lai/region_beds/72kb/{amr_ind}_{A,B}.bed
short_read_repeat_amr_beds.tar.gz./muc19/amr_lai/region_beds/short_read_repeat/{amr_ind}_{A,B}.bed
{amr_ind}_{A,B}.bed
Each admixed American individual ({amr_ind}) has two corresponding BED files, one per haplotype ({A,B}). Each BED file has the following columns:
- Columns 1-6: Original columns from
RFmix. - Columns 7-9: Coordinates for the intersecting region of interest.
- The BED files for the intersecting region of interest can be found here.
- Column 10: Number of base pairs of overlap with respect to the intersecting region of interest.
annotations
annotations.tar.gz
annotations.tar.gz contains the output from data_processing_v_revisions.ipynb found here and the outputs from tgp_single_archaic_gene_qc_v_revisions.py and consolidate_tgp_single_archaic_genes_v_revisions.py, which can be found here.
Paths
annotations.tar.gz./muc19/annotations/hg19_genes/ncbi_refseq_genes_chr{1..22}.csv.gz./muc19/annotations/hg19_genes/ncbi_refseq_transcripts.txt./muc19/annotations/hg19_genes/ncbiRefSeqSelect.txt.gz./muc19/annotations/rate_files/genetic_map_GRCh37_chr12.txt./muc19/annotations/rate_files/subst_rate_autosome.bed.gz./muc19/annotations/rate_files/subst_rate_autosome.bed.gz.tbi./muc19/annotations/rate_files/{742,72}kb_region.bed./muc19/annotations/rate_files/subst_rate_{742,72}kb.bed./muc19/annotations/tgp_den_masked_no_aa/ncbi_refseq_genes_summary_chr{1..22}.csv.gz./muc19/annotations/tgp_den_masked_no_aa/ncbi_refseq_{invariant,variant}_genes.csv.gz
ncbi_refseq_genes_chr{1..22}.csv.gz
One file per autosome ({1..22}) containing the NCBI RefSeq Select gene coordinates with the following columns:
GENE_ID: NCBI RefSeq Select gene ID.TRANSCRIPT_ID: NCBI RefSeq Select transcript ID.START: Start position (inclusive).STOP: Stop position (inclusive).
ncbi_refseq_transcripts.txt
One file with a single column listing all of the autosomal NCBI RefSeq Select transcript IDs used for the SnpEff annotations.
ncbiRefSeqSelect.txt.gz
Original NCBI RefSeq Select gene coordinates in GTF format downloaded from UCSC Genome browser (download link), and the format of the file is described here.
genetic_map_GRCh37_chr12.txt
HapMap genetic map for chromosome 12 build GRCh37/hg19, with the following columns:
Chromosome: Chromosome.Position(bp): Physical position.Rate(cM/Mb): Local recombination rate.Map(cM): The cumulative genetic map position with respect to the correspondingPosition(bp)coordinate.
subst_rate_autosome.bed.gz and subst_rate_autosome.bed.gz.tbi
subst_rate_autosome.bed.gz is a bgzipped BED file containing local mutation rates. This BED file was generated using this script, with the following columns:
- Column 1: Chromosome.
- Column 2: Start position (inclusive).
- Column 3: End position (exclusive).
- Column 4: Local mutation rate.
subst_rate_autosome.bed.gz.tbiis the corresponding binary index file usingtabix, generated by running the following code:
tabix -p bed subst_rate_autosome.bed.gz
{742,72}kb_region.bed
BED files with a single row per file for the focal 742kb and 72kb ({742,72}kb) MUC19 regions, with the following columns:
- Column 1: Chromosome.
- Column 2: Start position (inclusive).
- Column 3: End position (exclusive).
subst_rate_{742,72}kb.bed
BED files containing the local mutation rates for the focal 742kb and 72kb ({742,72}kb) MUC19 regions, generated by running the following bedtools code:
bedtools intersect -a subst_rate_autosome.bed.gz -b 72kb_region.bed -wo > subst_rate_72kb.bed
bedtools intersect -a subst_rate_autosome.bed.gz -b 742kb_region.bed -wo > subst_rate_742kb.bed
with the following columns:
- Columns 1-4: Original columns from
subst_rate_autosome.bed.gzfor the regions that intersect with the focal 742kb and 72kb ({742,72}kb) MUC19 regions. - Columns 5-7: Original columns from 742kb and 72kb MUC19 regions.
- Column 8: Number of base pairs from
subst_rate_autosome.bed.gz(i.e., columns 1-3 for a given row) that overlap with{742,72}kb_region.bed(i.e., columns 5-7 for a given row).
ncbi_refseq_genes_summary_chr{1..22}.csv.gz
One file per autosome (chr{1..22} ) where each row corresponds to the QC summary per NCBI RefSeq Select gene, with the following columns:
IDX: Gene index with respect to theCHRcolumn.GENE_ID: NCBI RefSeq Select gene ID.TRANSCRIPT_ID: NCBI RefSeq Select transcript ID.START: Start position (inclusive).STOP: Stop position (inclusive).DEN: Effective sequence length with respect to the Altai Denisovan.S: Number of segregating sites.QC: QC condition (numeric, e.g., 1 if true, 0 if false).
ncbi_refseq_{invariant,variant}_genes.csv.gz
One file for invariant (_invariant_) and one for variant (_variant_) NCBI RefSeq Select genes, with the following columns:
IDX: Gene index with respect to theCHRcolumn.GENE_ID: NCBI RefSeq Select gene ID.TRANSCRIPT_ID: NCBI RefSeq Select transcript ID.CHR: Chromosome.START: Start position (inclusive).STOP: Stop position (inclusive).DEN: Effective sequence length with respect to the Altai Denisovan.S: Number of segregating sites.QC: QC condition (numeric, e.g., 1 if true, 0 if false).
Note: If only column headers are present in the invariant file (_invariant_), no invariant windows passed initial QC.
arc_snp_density
classify_tgp_snps_chromosome.tar.gz
classify_tgp_snps_chromosome.tar.gz is the output from classify_tgp_snps_chromosome_v_revisions.py found here. The original path is ./muc19/muc19_results/tgp_arcs_masked_no_aa/{non_afr_pop,tgp}_{snp_type}_chr{1..22}.txt.gz.
{non_afr_pop,tgp}_{snp_type}_chr{1..22}.txt.gz
For each non-African population ({non_afr_pop}) and the entire 1000 Genomes Project ({tgp}), each file has one row listing the positions for the specified SNP classification ({snp_type}), stratified by autosome (chr{1..22}).
tgp_archaic_snp_denisty_windows.tar.gz
tgp_archaic_snp_denisty_windows.tar.gz is the output from tgp_archaic_snp_denisty_windows_v_revisions.py found here. The original path is ./muc19/muc19_results/tgp_arcs_masked_no_aa/{non_afr_pop}_arc_snp_denisty_chr{1..22}_{742,72}kb.txt.gz.
{non_afr_pop}_arc_snp_denisty_chr{1..22}_{742,72}kb.txt.gz
For each non-African population ({non_afr_pop}), each row contains archaic SNP partition counts for a non-overlapping 742kb or 72kb window ({742,72}kb), stratified by autosome (chr{1..22}), with the following columns:
- Column 1: Denisovan-specific SNPs.
- Column 2: Neanderthal-specific SNPs.
- Column 3: Shared archaic SNPs.
- Column 4: Archaic SNPs.
heterozygosity
{den,alt,cha,vin}_heterozygosity_chromosome.tar.gz
den_heterozygosity_chromosome.tar.gz, alt_heterozygosity_chromosome.tar.gz, cha_heterozygosity_chromosome.tar.gz, and vin_heterozygosity_chromosome.tar.gz are the output from archaic_heterozygosity_chromosome_v_revisions.py found here.
Paths
den_heterozygosity_chromosome.tar.gz./muc19/muc19_results/den_masked_no_aa/archaic_het_sites_heterozygosity_eff_seq_len_chr{1..22}.txt.gz
alt_heterozygosity_chromosome.tar.gz./muc19/muc19_results/alt_masked_no_aa/archaic_het_sites_heterozygosity_eff_seq_len_chr{1..22}.txt.gz
cha_heterozygosity_chromosome.tar.gz./muc19/muc19_results/cha_masked_no_aa/archaic_het_sites_heterozygosity_eff_seq_len_chr{1..22}.txt.gz
vin_heterozygosity_chromosome.tar.gz./muc19/muc19_results/vin_masked_no_aa/archaic_het_sites_heterozygosity_eff_seq_len_chr{1..22}.txt.gz
archaic_het_sites_heterozygosity_eff_seq_len_chr{1..22}.txt.gz
For each archaic individual ({den,alt,cha,vin}), there is one corresponding file per autosome (chr{1..22}) with the following columns:
- Column 1: Number of heterozygous sites.
- Column 2: Gene diversity.
- Column 3: Effective sequence length.
{den,alt,cha,vin}_heterozygosity_windows.tar.gz
den_heterozygosity_windows.tar.gz, alt_heterozygosity_windows.tar.gz, cha_heterozygosity_windows.tar.gz, and vin_heterozygosity_windows.tar.gz are the output from archaic_heterozygosity_windows_v_revisions.py found here.
Paths
den_heterozygosity_windows.tar.gz./muc19/muc19_results/den_masked_no_aa/archaic_het_sites_heterozygosity_chr{1..22}_72kb.txt.gz
alt_heterozygosity_windows.tar.gz./muc19/muc19_results/alt_masked_no_aa/archaic_het_sites_heterozygosity_chr{1..22}_72kb.txt.gz
cha_heterozygosity_windows.tar.gz./muc19/muc19_results/cha_masked_no_aa/archaic_het_sites_heterozygosity_chr{1..22}_72kb.txt.gz
vin_heterozygosity_windows.tar.gz./muc19/muc19_results/vin_masked_no_aa/archaic_het_sites_heterozygosity_chr{1..22}_72kb.txt.gz
archaic_het_sites_heterozygosity_chr{1..22}_72kb.txt.gz
For each archaic individual ({den,alt,cha,vin}), there is one corresponding file per autosome (chr{1..22}). Each row corresponds to a non-overlapping 72kb window and has the following columns:
- Column 1: Number of heterozygous sites.
- Column 2: Gene diversity.
tgp_heterozygosity_chromosome.tar.gz
tgp_heterozygosity_chromosome.tar.gz is the output from tgp_heterozygosity_chromosome_v_revisions.py found here.
Paths
tgp_heterozygosity_chromosome.tar.gz./muc19/muc19_results/tgp_mod_no_aa/tgp_het_sites_chr{1..22}.txt.gz./muc19/muc19_results/tgp_mod_no_aa/afr_heterozygosity_eff_seq_len_chr{1..22}.txt.gz
tgp_het_sites_chr{1..22}.txt.gz
For each autosome (chr{1..22}), the columns correspond to a 1000 Genomes Project individual, in the same order as individuals appear in ./muc19/meta_data/tgp_mod.txt. Each entry is the number of heterozygous sites for that individual in that chromosome.
afr_heterozygosity_eff_seq_len_chr{1..22}.txt.gz
For the African superpopulation (afr) in the 1000 Genomes Project, there is one corresponding file per autosome (chr{1..22}) with the following columns:
- Column 1: Gene diversity.
- Column 2: Effective sequence length.
tgp_heterozygosity_windows.tar.gz
tgp_heterozygosity_windows.tar.gz is the output from tgp_heterozygosity_windows_v_revisions.py found here. The original path is ./muc19/muc19_results/tgp_mod_no_aa/{afr,het,hom}_het_sites_chr{1..22}_72kb.txt.gz.
{afr,het,hom}_het_sites_chr{1..22}_72kb.txt.gz
For each focal group ({afr,het,hom}), there is one corresponding file per autosome (chr{1..22}). Each row corresponds to a non-overlapping 72kb window, and the columns correspond to the number of heterozygous sites per 1000 Genomes Project individual. The order of individuals is as specified in the following metadata files:
afr: African individuals.het: Individuals with one introgressed haplotype at the 72kb region.hom: Individuals with two introgressed haplotypes at the 72kb region.
iHS
tgp_selscan_maps.tar.gz & tgp_selscan_vcfs.tar.gz
tgp_selscan_maps.tar.gz and tgp_selscan_vcfs.tar.gz are the outputs from make_selscan_population_data_v_revisions.py found here.
Paths
tgp_selscan_maps.tar.gz./muc19/iHS/maps/{pop}_selscan_chr{1..22}.map
tgp_selscan_vcfs.tar.gz./muc19/iHS/selscan_vcfs/{pop}_selscan_chr{1..22}.vcf
{pop}_selscan_chr{1..22}.map
For every population in the 1000 Genomes Project ({pop}), there is one corresponding map file per autosome (chr{1..22}) with the following columns:
- Column 1: Chromosome.
- Column 2: Locus ID.
- Column 3: Genetic position.
- Column 4: Physical position.
{pop}_selscan_chr{1..22}.vcf
For every population in the 1000 Genomes Project ({pop}), there is one corresponding selscan formatted VCF file per autosome (chr{1..22}) with the following columns:
- Column 1: Chromosome.
- Column 2: Physical position.
- Column 3: Locus ID.
- Column 4: Ancestral allele.
- Column 5: Derived allele.
- Column 6: Quality field.
- Column 7: Filter field.
- Column 8: Info field.
- Column 9: Format field.
- Subsequent columns contain genotype information for individuals in the same order as they appear in
./muc19/meta_data/tgp_mod.txt.
tgp_selscan_ihs_output.tar.gz
tgp_selscan_ihs_output.tar.gz is the output from the selscan code found here.
Paths
tgp_selscan_ihs_output.tar.gz./muc19/muc19_results/tgp_mod_aa/{pop}_chr{1..22}.ihs.out./muc19/muc19_results/tgp_mod_aa/{pop}_chr{1..22}.ihs.out.100bins.norm
{pop}_chr{1..22}.ihs.out & {pop}_chr{1..22}.ihs.out.100bins.norm
For every population in the 1000 Genomes Project ({pop}), there is a corresponding unnormalized (.ihs.out) and normalized (.ihs.out.100bins.norm) iHS file per autosome (chr{1..22}). The columns are described in the selscan documentation, which you can find here.
tgp_ihs_tables.tar.gz
tgp_ihs_tables.tar.gz is the output from compile_ihs_tables_v_revisions.py found here.
Paths
tgp_ihs_tables.tar.gz:./muc19/muc19_results/tgp_mod_aa/{pop}_ihs_genome_wide.csv.gz./muc19/muc19_results/tgp_mod_aa/{pop}_ihs_windows_{742,72}kb.csv.gz
{pop}_ihs_genome_wide.csv.gz
For every population in the 1000 Genomes Project ({pop}), this file contains genome-wide iHS data with the following columns:
ID: Locus ID.CHR: Chromosome.POS: Physical position.DAF: Derived allele frequency.IHH1: Integrated haplotype homozygosity for the derived allele.IHH0: Integrated haplotype homozygosity for the ancestral allele.U_IHS: Unnormalized integrated haplotype score.N_IHS: Normalized integrated haplotype score.CRIT: Is the absolute value of the normalized iHS greater than two? (numeric, e.g., 1 if true, 0 if false).DEN: Is this a Denisovan-specific SNP? (boolean).NEA: Is this a Neanderthal-specific SNP? (boolean).SHR: Is this a shared archaic SNP? (boolean).ARC: Is this an archaic SNP? (boolean).HOM: Is this a shared Hominin SNP? (boolean).HUM: Is this a human-specific SNP? (boolean).ALL_CRIT: Is the absolute value of the normalized iHS greater than two? (boolean).DEN_CRIT: Is this a Denisovan-specific SNP AND is the absolute value of the normalized iHS greater than two? (boolean).NEA_CRIT: Is this a Neanderthal-specific SNP AND is the absolute value of the normalized iHS greater than two? (boolean).SHR_CRIT: Is this a shared archaic SNP AND is the absolute value of the normalized iHS greater than two? (boolean).ARC_CRIT: Is this an archaic SNP AND is the absolute value of the normalized iHS greater than two? (boolean).HOM_CRIT: Is this a shared Hominin SNP AND is the absolute value of the normalized iHS greater than two? (boolean).HUM_CRIT: Is this a human-specific SNP AND is the absolute value of the normalized iHS greater than two? (boolean).
{pop}_ihs_windows_{742,72}kb.csv.gz
For every population in the 1000 Genomes Project ({pop}), this file contains windowed iHS data. Each row corresponds to a non-overlapping window ({742,72}kb) with the following columns:
CHR: Chromosome.START: Start position (inclusive).STOP: Stop position (exclusive).N_ALL_SNPS: Number of SNPs.N_DEN_SNPS: Number of Denisovan-specific SNPs.N_NEA_SNPS: Number of Neanderthal-specific SNPs.N_SHR_SNPS: Number of shared archaic SNPs.N_ARC_SNPS: Number of archaic SNPs.N_HOM_SNPS: Number of shared Hominin SNPs.N_HUM_SNPS: Number of human-specific SNPs.N_ALL_CRIT: Number of SNPs with an absolute value of the normalized iHS greater than two.PROP_ALL_CRIT: Proportion of SNPs with an absolute value of the normalized iHS greater than two.N_DEN_CRIT: Number of Denisovan-specific SNPs with an absolute value of the normalized iHS greater than two.PROP_DEN_CRIT: Proportion of Denisovan-specific SNPs with an absolute value of the normalized iHS greater than two.N_NEA_CRIT: Number of Neanderthal-specific SNPs with an absolute value of the normalized iHS greater than two.PROP_NEA_CRIT: Proportion of Neanderthal-specific SNPs with an absolute value of the normalized iHS greater than two.N_SHR_CRIT: Number of shared archaic SNPs with an absolute value of the normalized iHS greater than two.PROP_SHR_CRIT: Proportion of shared archaic SNPs with an absolute value of the normalized iHS greater than two.N_ARC_CRIT: Number of archaic SNPs with an absolute value of the normalized iHS greater than two.PROP_ARC_CRIT: Proportion of archaic SNPs with an absolute value of the normalized iHS greater than two.N_HOM_CRIT: Number of shared Hominin SNPs with an absolute value of the normalized iHS greater than two.PROP_HOM_CRIT: Proportion of shared Hominin SNPs with an absolute value of the normalized iHS greater than two.N_HUM_CRIT: Number of human-specific SNPs with an absolute value of the normalized iHS greater than two.PROP_HUM_CRIT: Proportion of human-specific SNPs with an absolute value of the normalized iHS greater than two.
introgression
archaic_site_patterns_windows.tar.gz
archaic_site_patterns_windows.tar.gz is the output from archaic_site_patterns_windows_v_revisions.py found here.
Paths
archaic_site_patterns_windows.tar.gz./muc19/muc19_results/arcs_masked_aa/alt_{cha,vin}_den_chr{1..22}_72kb.txt.gz
alt_{cha,vin}_den_chr{1..22}_72kb.txt.gz
For each late Neanderthal ({cha,vin}), there is one corresponding file per autosome (chr{1..22}). Each row corresponds to a non-overlapping 72kb window with the following columns:
- Column 1: Number of ABBA sites.
- Column 2: Number of BABA sites.
- Column 3: Number of BBAA sites.
- Column 4: Number of BAAA sites.
- Column 5: Number of ABAA sites.
- Column 6: Number of AABA sites.
yri_na19664_archaic_site_patterns_windows.tar.gz
yri_na19664_archaic_site_patterns_windows.tar.gz is the output from mxl_archaic_site_patterns_windows_v_revisions.py found here.
Paths
yri_na19664_archaic_site_patterns_windows.tar.gz:./muc19/muc19_results/tgp_den_masked_aa/yri_na19664_den_chr{1..22}_{742,72}kb.txt.gz./muc19/muc19_results/tgp_alt_masked_aa/yri_na19664_alt_chr{1..22}_{742,72}kb.txt.gz./muc19/muc19_results/tgp_cha_masked_aa/yri_na19664_cha_chr{1..22}_{742,72}kb.txt.gz./muc19/muc19_results/tgp_vin_masked_aa/yri_na19664_vin_chr{1..22}_{742,72}kb.txt.gz
yri_na19664_{den,alt,cha,vin}_chr{1..22}_{742,72}kb.txt.gz
For each archaic individual ({den,alt,cha,vin}), there is one corresponding file per autosome (chr{1..22}). Each row corresponds to a non-overlapping window ({742,72}kb) with the following columns:
- Column 1: Number of ABBA sites.
- Column 2: Number of BABA sites.
- Column 3: Number of BBAA sites.
- Column 4: Number of BAAA sites.
- Column 5: Number of ABAA sites.
- Column 6: Number of AABA sites.
tgp_q95_u30_afr_b_den_windows.tar.gz
tgp_q95_u30_afr_b_den_windows.tar.gz is the output from tgp_q95_u30_afr_b_den_windows_v_revisions.py found here.
Paths
tgp_q95_u30_afr_b_den_windows.tar.gz./muc19/muc19_results/tgp_den_masked_no_aa/q95_u20_u30_afr_{non_afr_pop}_den_chr{1..22}_{742,72}kb.txt.gz
q95_u20_u30_afr_{non_afr_pop}_den_chr{1..22}_{742,72}kb.txt.gz
For each non-African population ({non_afr_pop}), there is one corresponding file per autosome (chr{1..22}). Each row corresponds to a non-overlapping window ({742,72}kb) with the following columns:
- Column 1: The Q95 value.
- Column 2: The U20 value.
- Column 3: The U30 value.
tgp_u30_afr_b_den_windows.tar.gz
tgp_u30_afr_b_den_windows.tar.gz is the output from tgp_u30_afr_b_den_windows_v_revisions.py found here.
Paths
tgp_u30_afr_b_den_windows.tar.gz./muc19/muc19_results/tgp_den_masked_no_aa/u30_afr_b_den_chr{1..22}_{742,72}kb.txt.gz
u30_afr_b_den_chr{1..22}_{742,72}kb.txt.gz
For each autosome (chr{1..22}), there is one corresponding file. Each row corresponds to a non-overlapping window ({742,72}kb), and the columns correspond to the non-African population used to compute U30 in the following order: MXL, PEL, CLM, PUR, BEB, STU, ITU, PJL, GIH, CHB, KHV, CHS, JPT, CDX, TSI, CEU, IBS, GBR, FIN.
tgp_u30_afr_b_den_genes.tar.gz
tgp_u30_afr_b_den_genes.tar.gz is the output from tgp_u30_afr_b_den_genes_v_revisions.py found here.
Paths
tgp_u30_afr_b_den_genes.tar.gz./muc19/muc19_results/tgp_den_masked_no_aa/u30_afr_b_den_per_gene_chr{1..22}.txt.gz
u30_afr_b_den_per_gene_chr{1..22}.txt.gz
For each autosome (chr{1..22}), there is one corresponding file. Each row corresponds to an NCBI RefSeq Select gene, and the columns correspond to the non-African population used to compute U30 in the following order: MXL, PEL, CLM, PUR, BEB, STU, ITU, PJL, GIH, CHB, KHV, CHS, JPT, CDX, TSI, CEU, IBS, GBR, FIN.
pbs
amr_asn_eur_pbs_windows.tar.gz
amr_asn_eur_pbs_windows.tar.gz is the output from amr_asn_eur_pbs_windows_v_revisions.py found here.
Paths
amr_asn_eur_pbs_windows.tar.gz./muc19/muc19_results/tgp_mod_no_aa/{amr_pop}_{asn_pop}_{eur_pop}_chr{1..22}_{742,72}kb.txt.gz
{amr_pop}_{asn_pop}_{eur_pop}_chr{1..22}_{742,72}kb.txt.gz
PBS(A, B, C) values for all unique combinations of A = AMR population ({amr_pop}), B = EAS/SAS population ({asn_pop}), and C = EUR population ({eur_pop}). There is one corresponding file per autosome (chr{1..22}). Each file contains a single row, where each column in that row represents the PBS(A,B,C) value for a non-overlapping window ({742,72}kb).
mxl_chb_ceu_pbs_chromsome.tar.gz
mxl_chb_ceu_pbs_chromsome.tar.gz is the output from mxl_chb_ceu_pbs_chromsome_v_revisions.py found here.
Paths
mxl_chb_ceu_pbs_chromsome.tar.gz./muc19/muc19_results/tgp_mod_no_aa/mxl_chb_ceu_pbs_partitions_chr{1..22}.txt.gz
mxl_chb_ceu_pbs_partitions_chr{1..22}.txt.gz
For each autosome (chr{1..22}), there is one corresponding file. Each row corresponds to a genomic position and has the following columns:
- Column 1: PBS using all MXL individuals.
- Column 2: PBS using only MXL individuals with more than 50% Indigenous American ancestry.
- Column 3: PBS using only MXL individuals with less than 50% Indigenous American ancestry.
mxl_chb_ceu_pbs_windows.tar.gz
mxl_chb_ceu_pbs_windows.tar.gz is the output from mxl_chb_ceu_pbs_windows_v_revisions.py found here.
Paths
mxl_chb_ceu_pbs_windows.tar.gz./muc19/muc19_results/tgp_mod_no_aa/mxl_chb_ceu_pbs_partitions_chr{1..22}_{742,72}kb.txt.gz
mxl_chb_ceu_pbs_partitions_chr{1..22}_{742,72}kb.txt.gz
For each autosome (chr{1..22}), there is one corresponding file. Each row corresponds to a non-overlapping window ({742,72}kb) with the following columns:
- Column 1: PBS using all MXL individuals.
- Column 2: PBS using only MXL individuals with more than 50% Indigenous American ancestry.
- Column 3: PBS using only MXL individuals with less than 50% Indigenous American ancestry.
sprime_sites_mxl_chb_ceu_pbs_windows.tar.gz
sprime_sites_mxl_chb_ceu_pbs_windows.tar.gz is the output from sprime_sites_mxl_chb_ceu_pbs_windows_v_revisions.py found here.
Paths
sprime_sites_mxl_chb_ceu_pbs_windows.tar.gz./muc19/muc19_results/tgp_mod_no_aa/mxl_chb_ceu_pbs_sprime_all_arc_chr{1..22}_{742,72}kb.txt.gz
mxl_chb_ceu_pbs_sprime_all_arc_chr{1..22}_{742,72}kb.txt.gz
For each autosome (chr{1..22}), there is one corresponding file. Each row corresponds to a non-overlapping window ({742,72}kb) with the following columns:
- Column 1: PBS for
SPrimesites in MXL that match with either the Altai Denisovan or Altai Neanderthal. - Column 2: Number of
SPrimesites in MXL that match with either the Altai Denisovan or Altai Neanderthal.
pbs_sims
{neutral,negative,positive_neutral_s1,positive_neutral_s01,positive_neutral_s0015}_recomb_map_slimulated_vcfs.tar.gz
These archives (listed below) contain the VCF files from the SLiMulations. The SLiM scripts can be found here.
neutral_recomb_map_slimulated_vcfs.tar.gzis output fromsim_sneutral_rmap.slim.negative_recomb_map_slimulated_vcfs.tar.gzis output fromsim_snegative_rmap.slim.positive_neutral_s1_recomb_map_slimulated_vcfs.tar.gzis output fromsim_spositive_neutral_1_rmap.slim.positive_neutral_s01_recomb_map_slimulated_vcfs.tar.gzis output fromsim_spositive_neutral_01_rmap.slim.positive_neutral_s0015_recomb_map_slimulated_vcfs.tar.gzis output fromsim_spositive_neutral_0015_rmap.slim.
Paths
neutral_recomb_map_slimulated_vcfs.tar.gz./muc19/pbs_sims/smodel/neutral/recomb_map/VCF_sneutral_{sim_id}_{rep_id}.txt.gz
negative_recomb_map_slimulated_vcfs.tar.gz./muc19/pbs_sims/smodel/negative/recomb_map/VCF_snegative_{sim_id}_{rep_id}.txt.gz
positive_neutral_s1_recomb_map_slimulated_vcfs.tar.gz./muc19/pbs_sims/smodel/positive_neutral_1/VCF_spositive_neutral_{sim_id}_{rep_id}.txt
positive_neutral_s01_recomb_map_slimulated_vcfs.tar.gz./muc19/pbs_sims/smodel/positive_neutral_01/VCF_spositive_neutral_{sim_id}_{rep_id}.txt
positive_neutral_s0015_recomb_map_slimulated_vcfs.tar.gz./muc19/pbs_sims/smodel/positive_neutral_0015/VCF_spositive_neutral_{sim_id}_{rep_id}.txt
VCF_{sneutral,snegative,spositive_neutral}_{sim_id}_{rep_id}.{txt,txt.gz}
There are 10,000 simulated replicates ({sim_id}_{rep_id}), each as a bgzipped VCF file (.txt.gz), for the neutral ({sneutral}) and heterosis ({snegative}) SLiMulations. There are 1,000 simulated replicates, each as a VCF file (.txt), for the positive selection ({spositive_neutral}) SLiMulations. In each VCF file, the first 99 samples are simulated European individuals, the next 103 samples are simulated East Asian individuals, and the last 64 samples are simulated Mexican individuals. The VCF format specification can be found here.
mxl_slimulations.tar.gz
mxl_slimulations.tar.gz is the output from extract_neutral_negative_recomb_map_slimulated_results.py and extract_positive_slimulations_info.py found here.
Paths
mxl_slimulations.tar.gz./muc19/muc19_results/mxl_slimulations/{neutral,negative}_per_10k_replicates.csv.gz./muc19/muc19_results/mxl_slimulations/{neutral,negative}_{all,arc}_snp_freqs_{742,72}kb_per_snp.txt.gz./muc19/muc19_results/mxl_slimulations/{neutral,negative}_pbs_{all,arc}_snps_{742,72}kb_per_snp.txt.gz./muc19/muc19_results/mxl_slimulations/positive_s{s1,s01,s0015}_per_1k_replicates.csv.gz
{neutral,negative}_per_10k_replicates.csv.gz
For each selection scenario ({neutral,negative}), there is one corresponding file. Each row corresponds to one simulated replicate, with the following columns:
geq_all_snps_742kb: Number of SNPs in MXL segregating at a frequency greater than or equal to 0.3 in the 742kb region.pbs_all_snps_742kb: PBS(A=MXL,B=EAS,C=EUR) value for the 742kb region using all SNPs.prop_out_pbs_all_snps_742kb: Proportion of per-SNP PBS(A=MXL,B=EAS,C=EUR) values greater than the empirical 99.95th percentile for the 742kb region.geq_arc_snps_742kb: Number of archaic SNPs in MXL segregating at a frequency greater than or equal to 0.3 in the 742kb region.pbs_arc_snps_742kb: PBS(A=MXL,B=EAS,C=EUR) value for the 742kb region using archaic SNPs.prop_out_pbs_arc_snps_742kb: Proportion of per-archaic SNP PBS(A=MXL,B=EAS,C=EUR) values greater than the empirical 99.95th percentile for the 742kb region.geq_all_snps_72kb: Number of SNPs in MXL segregating at a frequency greater than or equal to 0.3 in the 72kb region.pbs_all_snps_72kb: PBS(A=MXL,B=EAS,C=EUR) value for the 72kb region using all SNPs.prop_out_pbs_all_snps_72kb: Proportion of per-SNP PBS(A=MXL,B=EAS,C=EUR) values greater than the empirical 99.95th percentile for the 72kb region.geq_arc_snps_72kb: Number of archaic SNPs in MXL segregating at a frequency greater than or equal to 0.3 in the 72kb region.pbs_arc_snps_72kb: PBS(A=MXL,B=EAS,C=EUR) value for the 72kb region using archaic SNPs.prop_out_pbs_arc_snps_72kb: Proportion of per-archaic SNP PBS(A=MXL,B=EAS,C=EUR) values greater than the empirical 99.95th percentile for the 72kb region.
{neutral,negative}_{all,arc}_snp_freqs_{742,72}kb_per_snp.txt.gz
For each selection scenario ({neutral,negative}), there is one corresponding file for all ({all}) and archaic ({arc}) SNPs per focal region ({742,72}kb). Each file has one row, where the values represent the allele frequency for sites segregating in MXL across all 10,000 simulated replicates.
{neutral,negative}_pbs_{all,arc}_snps_{742,72}kb_per_snp.txt.gz
For each selection scenario ({neutral,negative}), there is one corresponding file for all ({all}) and archaic ({arc}) SNPs per focal region ({742,72}kb). Each file has one row, where the values represent the per-SNP PBS(A=MXL,B=EAS,C=EUR) values across all 10,000 simulated replicates.
positive_s{s1,s01,s0015}_per_1k_replicates.csv.gz
For each selection coefficient ({s1,s01,s0015}), there is one corresponding file. Each row corresponds to one simulated replicate, with the following columns:
vcf_file: Corresponding VCF file.rng_seed: Corresponding SLiM rng seed.org_rep_id: Arbitrary replicate ID.n_den_742kb: Number of Denisovan SNPs in the 742kb region.n_nea_742kb: Number of Neanderthal SNPs in the 742kb region.n_arc_742kb: Number of archaic SNPs in the 742kb region.n_den_72kb: Number of Denisovan SNPs in the 72kb region.n_nea_72kb: Number of Neanderthal SNPs in the 72kb region.n_arc_72kb: Number of archaic SNPs in the 72kb region.sel_pos: Position of the selected mutation.sel_origin: Population origin of the selected mutation.sel_mxl_daf: Frequency of the selected allele in the sampled MXL individuals.sel_eas_daf: Frequency of the selected allele in the sampled EAS individuals.sel_eur_daf: Frequency of the selected allele in the sampled EUR individuals.MXB_before: Frequency of the mutation in the MXB population before positive selection.EAS_before: Frequency of the mutation in the EAS population before positive selection.EUR_before: Frequency of the mutation in the EUR population before positive selection.MXB_after: Frequency of the selected allele in the MXB population at the end of the SLiMulation.MXL_after: Frequency of the selected allele in the MXL population at the end of the SLiMulation.EAS_after: Frequency of the selected allele in the EAS population at the end of the SLiMulation.EUR_after: Frequency of the selected allele in the EUR population at the end of the SLiMulation.n_seg_all_snps_742kb: Number of segregating sites in the 742kb region.pbs_all_snps_742kb: PBS(A=MXL,B=EAS,C=EUR) value for the 742kb region using all SNPs.fst_mxl_eas_all_snps_742kb: Fst(MXL,EAS) value for the 742kb region using all SNPs.fst_mxl_eur_all_snps_742kb: Fst(MXL,EUR) value for the 742kb region using all SNPs.fst_eas_eur_all_snps_742kb: Fst(EAS,EUR) value for the 742kb region using all SNPs.n_seg_arc_snps_742kb: Number of archaic SNPs segregating in the 742kb region.pbs_arc_snps_742kb: PBS(A=MXL,B=EAS,C=EUR) value for the 742kb region using archaic SNPs.fst_mxl_eas_arc_snps_742kb: Fst(MXL,EAS) value for the 742kb region using archaic SNPs.fst_mxl_eur_arc_snps_742kb: Fst(MXL,EUR) value for the 742kb region using archaic SNPs.fst_eas_eur_arc_snps_742kb: Fst(EAS,EUR) value for the 742kb region using archaic SNPs.n_seg_all_snps_72kb: Number of segregating sites in the 722kb region.pbs_all_snps_72kb: PBS(A=MXL,B=EAS,C=EUR) value for the 72kb region using all SNPs.fst_mxl_eas_all_snps_72kb: Fst(MXL,EAS) value for the 72kb region using all SNPs.fst_mxl_eur_all_snps_72kb: Fst(MXL,EUR) value for the 72kb region using all SNPs.fst_eas_eur_all_snps_72kb: Fst(EAS,EUR) value for the 72kb region using all SNPs.n_seg_arc_snps_72kb: Number of archaic SNPs segregating in the 72kb region.pbs_arc_snps_72kb: PBS(A=MXL,B=EAS,C=EUR) value for the 72kb region using archaic SNPs.fst_mxl_eas_arc_snps_72kb: Fst(MXL,EAS) value for the 72kb region using archaic SNPs.fst_mxl_eur_arc_snps_72kb: Fst(MXL,EUR) value for the 72kb region using archaic SNPs.fst_eas_eur_arc_snps_72kb: Fst(EAS,EUR) value for the 72kb region using archaic SNPs.
psuedo_ancestry_painting
archaic_psuedo_ancestry_painting_windows.tar.gz
archaic_psuedo_ancestry_painting_windows.tar.gz is the output from archaic_psuedo_ancestry_painting_windows_v_revisions.py found here.
Paths
archaic_psuedo_ancestry_painting_windows.tar.gz./muc19/muc19_results/arcs_masked_no_aa/{cha,vin}_den_alt_pap_counts_chr{1..22}_72kb.txt.gz
{cha,vin}_den_alt_pap_counts_chr{1..22}_72kb.txt.gz
For each late Neanderthal ({cha,vin}), there is one corresponding file per autosome (chr{1..22}). Each row corresponds to a non-overlapping 72kb window with the following columns:
- Column 1: Number of PAP sites.
- Column 2: Number of heterozygous sites.
tgp_archaic_psuedo_ancestry_painting_windows.tar.gz
tgp_archaic_psuedo_ancestry_painting_windows.tar.gz is the output from tgp_archaic_psuedo_ancestry_painting_windows_v_revisions.py found here.
Paths
tgp_archaic_psuedo_ancestry_painting_windows.tar.gz./muc19/muc19_results/tgp_arcs_masked_no_aa/{den,alt,cha,vin}_mxl_yri_pap_counts_chr{1..22}_72kb.txt.gz
{cha,vin}_den_alt_pap_counts_chr{1..22}_72kb.txt.gz
For each archaic individual ({den,alt,cha,vin}), there is one corresponding file per autosome (chr{1..22}). Each row corresponds to a non-overlapping 72kb window with the following columns:
- Column 1: Number of PAP sites.
- Column 2: Number of heterozygous sites.
sequence_divergence
den_v_alt_divergence_windows.tar.gz
den_v_alt_divergence_windows.tar.gz is the output from arc_dip_v_arc_dip_divergence_windows_v_revisions.py found here.
Paths
den_v_alt_divergence_windows.tar.gz./muc19/muc19_results/arcs_masked_no_aa/den_v_alt_pw_diffs_chr{1..22}_72kb.txt.gz
den_v_alt_pw_diffs_chr{1..22}_72kb.txt.gz
For each autosome (chr{1..22}), there is one corresponding file. This file contains a single row, where each column in that row corresponds to the average number of pairwise differences between the Altai Denisovan's and Altai Neanderthal's two chromosomes per non-overlapping 72kb window.
sgdp_denisovan_sequence_divergence_at_denisovan_intro_tracts_in_papuans.tar.gz
sgdp_denisovan_sequence_divergence_at_denisovan_intro_tracts_in_papuans.tar.gz is the output from sgdp_denisovan_sequence_divergence_at_denisovan_intro_tracts_in_papuans_v_revisions.py found here.
Paths
sgdp_denisovan_sequence_divergence_at_denisovan_intro_tracts_in_papuans.tar.gz./muc19/muc19_results/sgdp_den_masked_no_aa/{pap_ind}_hap{1,2}_den_pw_diffs_seq_div_at_den_intro_tracts_chr{1..22}.txt.gz
{pap_ind}_hap{1,2}_den_pw_diffs_seq_div_at_den_intro_tracts_chr{1..22}.txt.gz
For each Papuan individual ({pap_ind}), there is one corresponding file per autosome (chr{1..22}) for each haploid genome (hap{1,2}). Each row corresponds to a Denisovan introgressed tract for the corresponding haploid genome (hap{1,2}) in the respective Papuan individual ({pap_ind}) with the following columns:
- Column 1: The average number of pairwise differences between the Papuan individual's haplotype and the Altai Denisovan's two chromosomes.
- Column 2: The corresponding sequence divergence.
tgp_hap_v_arc_dip_divergence_windows.tar.gz
tgp_hap_v_arc_dip_divergence_windows.tar.gz is the output from tgp_hap_v_arc_dip_divergence_windows_v_revisions.py found here.
Paths
tgp_hap_v_arc_dip_divergence_windows.tar.gz./muc19/muc19_results/tgp_den_masked_no_aa/den_hap_{1,2}_pw_diffs_chr{1..22}_{742,72}kb.txt.gz./muc19/muc19_results/tgp_alt_masked_no_aa/alt_hap_{1,2}_pw_diffs_chr{1..22}_{742,72}kb.txt.gz./muc19/muc19_results/tgp_cha_masked_no_aa/cha_hap_{1,2}_pw_diffs_chr{1..22}_{742,72}kb.txt.gz./muc19/muc19_results/tgp_vin_masked_no_aa/vin_hap_{1,2}_pw_diffs_chr{1..22}_{742,72}kb.txt.gz
{den,alt,cha,vin}_hap_{1,2}_pw_diffs_chr{1..22}_{742,72}kb.txt.gz
For each archaic individual ({den,alt,cha,vin}), there is one corresponding file per autosome (chr{1..22}) for each haploid genome (hap_{1,2}). Each row corresponds to a non-overlapping window ({742,72}kb). The columns correspond to the average number of pairwise differences between a 1000 Genomes Project individual's haplotype (hap_{1,2}) and the archaic individual's ({den,alt,cha,vin}) two chromosomes, in the same order as individuals appear in ./muc19/meta_data/tgp_mod.txt.
tgp_hap_v_arc_dip_divergence_windows.tar.gz
tgp_hap_v_arc_dip_divergence_windows.tar.gz is the output from tgp_hap_v_arc_psuedo_hap_divergence_windows_v_revisions.py found here.
Paths
tgp_hap_v_arc_psuedo_hap_divergence_windows.tar.gz./muc19/muc19_results/tgp_cha_masked_no_aa/cha_hap_{1,2}_psuedo_hap_diffs_chr{1..22}_72kb.txt.gz./muc19/muc19_results/tgp_vin_masked_no_aa/vin_hap_{1,2}_psuedo_hap_diffs_chr{1..22}_72kb.txt.gz
{cha,vin}_hap_{1,2}_psuedo_hap_diffs_chr{1..22}_72kb.txt.gz
For each late Neanderthal ({cha,vin}), there is one corresponding file per autosome (chr{1..22}) for each haploid genome (hap_{1,2}). Each row corresponds to a non-overlapping 72kb window. The columns correspond to the average number of pairwise differences between a 1000 Genomes Project individual's haplotype (hap_{1,2}) and the late Neanderthal's ({cha,vin}) pseudo-haplotype, in the same order as individuals appear in ./muc19/meta_data/tgp_mod.txt.
afr_v_den_divergence_windows.tar.gz
afr_v_den_divergence_windows.tar.gz is the output from tgp_spop_v_arc_dip_divergence_windows_v_revisions.py found here.
Paths
afr_v_den_divergence_windows.tar.gz./muc19/muc19_results/tgp_den_masked_no_aa/afr_den_avg_pw_diffs_chr{1..22}_72kb.txt.gz
afr_den_avg_pw_diffs_chr{1..22}_72kb.txt.gz
For each autosome (chr{1..22}), there is one corresponding file. This file contains one row, where each column corresponds to the average number of pairwise differences between all African chromosomes in the 1000 Genomes Project and the Altai Denisovan's two chromosomes per non-overlapping 72kb window.
vcf_data
Since ALL of the VCF files and bookkeeping information are well over 3TB, we provide the converted Zarr arrays, which are the output of vcf_to_zarr_v_revisions.py found here, as 99.99% of analyses are performed using these data. Additionally, in the following windowing subsubsection, we provide the corresponding bookkeeping files used in our analyses and include the raw bookkeeping files for the sake of completeness, which are located in bookkeeping.tar.gz. Note that all VCF files are formatted in accordance with the VCF specification, and the format of the raw bookkeeping files is described in the VCF processing scripts found here. The {prefix}.tar.gz files are named by the dataset ({prefix}) and, unless otherwise noted, contain the corresponding Zarr arrays for all autosomes (chr{1..22}).
Datasets & Paths
arcs_masked_no_aa.tar.gz- All archaic individuals without ancestral allele calls. Individuals are in the following order: Altai Neanderthal, Chagyrskaya Neanderthal, Vindija Neanderthal, Altai Denisovan.
./muc19/zarr_data/arcs_masked_no_aa_chr{1..22}.zarr
arcs_masked_aa.tar.gz- All archaic individuals with ancestral allele calls. Individuals are in the following order: Altai Neanderthal, Chagyrskaya Neanderthal, Vindija Neanderthal, Altai Denisovan, and a placeholder individual "Ancestor" who is always homozygous for the ancestral allele.
./muc19/zarr_data/arcs_masked_no_aa_chr{1..22}.zarr
{den,alt,cha,vin}_masked_no_aa.tar.gzden_masked_no_aa.tar.gz: Altai Denisovan without ancestral allele calls../muc19/zarr_data/den_masked_no_aa_chr{1..22}.zarr
alt_masked_no_aa.tar.gz: Altai Denisovan without ancestral allele calls../muc19/zarr_data/alt_masked_no_aa_chr{1..22}.zarr
cha_masked_no_aa.tar.gz: Chagyrskaya Neanderthal without ancestral allele calls../muc19/zarr_data/cha_masked_no_aa_chr{1..22}.zarr
vin_masked_no_aa.tar.gz: Vindija Neanderthal without ancestral allele calls../muc19/zarr_data/vin_masked_no_aa_chr{1..22}.zarr
tgp_arcs_masked_no_aa.tar.gz- 1000 Genomes Project and all archaic individuals without ancestral allele calls. Individuals are in the following order: 1000 Genomes Project individuals (order as in
./muc19/meta_data/tgp_mod.txt), Altai Neanderthal, Chagyrskaya Neanderthal, Vindija Neanderthal, Altai Denisovan. ./muc19/zarr_data/tgp_arcs_masked_no_aa_chr{1..22}.zarr
- 1000 Genomes Project and all archaic individuals without ancestral allele calls. Individuals are in the following order: 1000 Genomes Project individuals (order as in
tgp_{den,alt,cha,vin}_masked_no_aa.tar.gztgp_den_masked_no_aa.tar.gz: 1000 Genomes Project (order as in./muc19/meta_data/tgp_mod.txt) and the Altai Denisovan, without ancestral allele calls../muc19/zarr_data/tgp_den_masked_no_aa_chr{1..22}.zarr
tgp_alt_masked_no_aa.tar.gz: 1000 Genomes Project (order as in./muc19/meta_data/tgp_mod.txt) and the Altai Neanderthal, without ancestral allele calls../muc19/zarr_data/tgp_alt_masked_no_aa_chr{1..22}.zarr
tgp_cha_masked_no_aa.tar.gz: 1000 Genomes Project (order as in./muc19/meta_data/tgp_mod.txt) and the Chagyrskaya Neanderthal, without ancestral allele calls../muc19/zarr_data/tgp_cha_masked_no_aa_chr{1..22}.zarr
tgp_vin_masked_no_aa.tar.gz: 1000 Genomes Project (order as in./muc19/meta_data/tgp_mod.txt) and the Vindija Neanderthal, without ancestral allele calls../muc19/zarr_data/tgp_vin_masked_no_aa_chr{1..22}.zarr
tgp_{den,alt,cha,vin}_masked_aa.tar.gztgp_den_masked_aa.tar.gz: 1000 Genomes Project (order as in./muc19/meta_data/tgp_mod.txt), the Altai Denisovan, and a placeholder individual "Ancestor" who is always homozygous for the ancestral allele../muc19/zarr_data/tgp_den_masked_aa_chr{1..22}.zarr
tgp_alt_masked_aa.tar.gz: 1000 Genomes Project (order as in./muc19/meta_data/tgp_mod.txt), the Altai Neanderthal, and a placeholder individual "Ancestor" who is always homozygous for the ancestral allele../muc19/zarr_data/tgp_alt_masked_aa_chr{1..22}.zarr
tgp_cha_masked_aa.tar.gz: 1000 Genomes Project (order as in./muc19/meta_data/tgp_mod.txt), the Chagyrskaya Neanderthal, and a placeholder individual "Ancestor" who is always homozygous for the ancestral allele../muc19/zarr_data/tgp_cha_masked_aa_chr{1..22}.zarr
tgp_vin_masked_aa.tar.gz: 1000 Genomes Project (order as in./muc19/meta_data/tgp_mod.txt), the Vindija Neanderthal, and a placeholder individual "Ancestor" who is always homozygous for the ancestral allele../muc19/zarr_data/tgp_vin_masked_aa_chr{1..22}.zarr
sgdp_arcs_masked_no_aa.tar.gz- Simon's Genome Diversity Project and all archaic individuals without ancestral allele calls. Individuals are in the following order: Simon's Genome Diversity Project individuals (order as in
./muc19/meta_data/sgdp.txt), Altai Neanderthal, Chagyrskaya Neanderthal, Vindija Neanderthal, Altai Denisovan. ./muc19/zarr_data/sgdp_arcs_masked_no_aa_chr{1..22}.zarr
- Simon's Genome Diversity Project and all archaic individuals without ancestral allele calls. Individuals are in the following order: Simon's Genome Diversity Project individuals (order as in
sgdp_{den,alt,cha,vin}_masked_no_aa.tar.gzsgdp_den_masked_no_aa.tar.gz: Simon's Genome Diversity Project (order as in./muc19/meta_data/sgdp.txt) and the Altai Denisovan, without ancestral allele calls../muc19/zarr_data/sgdp_den_masked_no_aa_chr{1..22}.zarr
sgdp_alt_masked_no_aa.tar.gz: Simon's Genome Diversity Project (order as in./muc19/meta_data/sgdp.txt) and the Altai Neanderthal, without ancestral allele calls../muc19/zarr_data/sgdp_alt_masked_no_aa_chr{1..22}.zarr
sgdp_cha_masked_no_aa.tar.gz: Simon's Genome Diversity Project (order as in./muc19/meta_data/sgdp.txt) and the Chagyrskaya Neanderthal, without ancestral allele calls../muc19/zarr_data/sgdp_cha_masked_no_aa_chr{1..22}.zarr
sgdp_vin_masked_no_aa.tar.gz: Simon's Genome Diversity Project (order as in./muc19/meta_data/sgdp.txt) and the Vindija Neanderthal, without ancestral allele calls../muc19/zarr_data/sgdp_vin_masked_no_aa_chr{1..22}.zarr
tgp_mod_no_aa.tar.gz- 1000 Genomes Project without ancestral allele calls. Individuals are in the same order as they appear in
./muc19/meta_data/tgp_mod.txt. ./muc19/zarr_data/tgp_mod_no_aa_chr{1..22}.zarr
- 1000 Genomes Project without ancestral allele calls. Individuals are in the same order as they appear in
tgp_mod_aa.tar.gz- 1000 Genomes Project with ancestral allele calls. Individuals are in the same order as they appear in
./muc19/meta_data/tgp_mod.txt, and a placeholder individual "Ancestor" who is always homozygous for the ancestral allele. ./muc19/zarr_data/tgp_mod_aa_chr{1..22}.zarr
- 1000 Genomes Project with ancestral allele calls. Individuals are in the same order as they appear in
cha_phased_ref_panel_all_inds.tar.gz- Corresponding output from
BEAGLEbefore read-based phasing for the focal 72kb region, without ancestral allele calls. Individuals are in the following order: Altai Neanderthal, Chagyrskaya Neanderthal, Altai Denisovan. ./muc19/zarr_data/cha_phased_ref_panel_all_inds.zarr- Note that this is a single Zarr array for the focal 72kb region.
- The input VCF file can be found here.
- Corresponding output from
vin_phased_ref_panel_all_inds.tar.gz- Corresponding output from
BEAGLEbefore read-based phasing for the focal 72kb region, without ancestral allele calls. Individuals are in the following order: Altai Neanderthal, Vindija Neanderthal, Altai Denisovan. ./muc19/zarr_data/vin_phased_ref_panel_all_inds.zarr- Note that this is a single Zarr array for the focal 72kb region.
- The input VCF file can be found here.
- Corresponding output from
windowing
arcs_masked_no_aa_window_info_and_eff_seq_len.tar.gz
arcs_masked_no_aa_window_info_and_eff_seq_len.tar.gz is the output from consolidate_all_archaics_tgp_sgdp_windows_v_revisions.py and compute_region_effective_sequence_lengths_v_revisions.py found here.
Paths
arcs_masked_no_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/arcs_masked_no_aa/72kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/arcs_masked_no_aa/72kb_esl_qced_nonoverlapping_variant_windows.txt.gz./muc19/windowing/arcs_masked_no_aa/{742,72}kb_eff_seq_len.txt.gz
72kb_nonoverlapping_{invariant,variant}_windows.csv.gz
One file for invariant (_invariant_) and one for variant (_variant_) 72kb non-overlapping windows, with the following columns:
IDX: Window index with respect to theCHRcolumn.CHR: Chromosome.START: Start position (inclusive).STOP: Stop position (inclusive).ALT: Altai Neanderthal effective sequence length.CHA: Chagyrskaya Neanderthal effective sequence length.VIN: Vindija Neanderthal effective sequence length.DEN: Altai Denisovan effective sequence length.DEN-ALT: Effective sequence length for Altai Denisovan-Altai Neanderthal comparisons.DEN-CHA: Effective sequence length for Altai Denisovan-Chagyrskaya Neanderthal comparisons.DEN-VIN: Effective sequence length for Altai Denisovan-Vindija Neanderthal comparisons.ALT-CHA: Effective sequence length for Altai Neanderthal-Chagyrskaya Neanderthal comparisons.ALT-VIN: Effective sequence length for Altai Neanderthal-Vindija Neanderthal comparisons.CHA-VIN: Effective sequence length for Chagyrskaya Neanderthal-Vindija Neanderthal comparisons.S: Number of segregating sites.QC: QC condition (numeric, e.g., 1 if true, 0 if false).
Note: If only column headers are present in the invariant file (_invariant_), no invariant windows passed initial QC.
72kb_esl_qced_nonoverlapping_variant_windows.txt.gz
Indices of 72kb variant windows of comparable effective sequence length, used for subsetting 72kb_nonoverlapping_variant_windows.csv.gz.
{742,72}kb_eff_seq_len.txt.gz
Effective sequence lengths for the focal regions ({742,72}kb) with the following columns:
- Column 1: Altai Neanderthal effective sequence length.
- Column 2: Chagyrskaya Neanderthal effective sequence length.
- Column 3: Vindija Neanderthal effective sequence length.
- Column 4: Altai Denisovan effective sequence length.
- Column 5: Effective sequence length for Altai Denisovan-Altai Neanderthal comparisons.
- Column 6: Effective sequence length for Altai Denisovan-Chagyrskaya Neanderthal comparisons.
- Column 7: Effective sequence length for Altai Denisovan-Vindija Neanderthal comparisons.
- Column 8: Effective sequence length for Altai Neanderthal-Chagyrskaya Neanderthal comparisons.
- Column 9: Effective sequence length for Altai Neanderthal-Vindija Neanderthal comparisons.
- Column 10: Effective sequence length for Chagyrskaya Neanderthal-Vindija Neanderthal comparisons.
arcs_masked_aa_window_info_and_eff_seq_len.tar.gz
arcs_masked_aa_window_info_and_eff_seq_len.tar.gz is the output from consolidate_all_archaics_tgp_sgdp_windows_v_revisions.py and compute_region_effective_sequence_lengths_v_revisions.py found here.
Paths
arcs_masked_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/arcs_masked_aa/72kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/arcs_masked_aa/72kb_esl_qced_nonoverlapping_variant_windows.txt.gz./muc19/windowing/arcs_masked_aa/{742,72}kb_eff_seq_len.txt.gz
72kb_nonoverlapping_{invariant,variant}_windows.csv.gz
One file for invariant (_invariant_) and one for variant (_variant_) 72kb non-overlapping windows, with the following columns:
IDX: Window index with respect to theCHRcolumn.CHR: Chromosome.START: Start position (inclusive).STOP: Stop position (inclusive).ALT: Altai Neanderthal effective sequence length.CHA: Chagyrskaya Neanderthal effective sequence length.VIN: Vindija Neanderthal effective sequence length.DEN: Altai Denisovan effective sequence length.DEN-ALT: Effective sequence length for Altai Denisovan-Altai Neanderthal comparisons.DEN-CHA: Effective sequence length for Altai Denisovan-Chagyrskaya Neanderthal comparisons.DEN-VIN: Effective sequence length for Altai Denisovan-Vindija Neanderthal comparisons.ALT-CHA: Effective sequence length for Altai Neanderthal-Chagyrskaya Neanderthal comparisons.ALT-VIN: Effective sequence length for Altai Neanderthal-Vindija Neanderthal comparisons.CHA-VIN: Effective sequence length for Chagyrskaya Neanderthal-Vindija Neanderthal comparisons.S: Number of segregating sites.QC: QC condition (numeric, e.g., 1 if true, 0 if false).
Note: If only column headers are present in the invariant file (_invariant_), no invariant windows passed initial QC.
72kb_esl_qced_nonoverlapping_variant_windows.txt.gz
Indices of 72kb variant windows of comparable effective sequence length, used for subsetting 72kb_nonoverlapping_variant_windows.csv.gz.
{742,72}kb_eff_seq_len.txt.gz
Effective sequence lengths for the focal regions ({742,72}kb) with the following columns:
- Column 1: Altai Neanderthal effective sequence length.
- Column 2: Chagyrskaya Neanderthal effective sequence length.
- Column 3: Vindija Neanderthal effective sequence length.
- Column 4: Altai Denisovan effective sequence length.
- Column 5: Effective sequence length for Altai Denisovan-Altai Neanderthal comparisons.
- Column 6: Effective sequence length for Altai Denisovan-Chagyrskaya Neanderthal comparisons.
- Column 7: Effective sequence length for Altai Denisovan-Vindija Neanderthal comparisons.
- Column 8: Effective sequence length for Altai Neanderthal-Chagyrskaya Neanderthal comparisons.
- Column 9: Effective sequence length for Altai Neanderthal-Vindija Neanderthal comparisons.
- Column 10: Effective sequence length for Chagyrskaya Neanderthal-Vindija Neanderthal comparisons.
{den,alt,cha,vin}_masked_no_aa_window_info_and_eff_seq_len.tar.gz
arcs_masked_no_aa_window_info_and_eff_seq_len.tar.gz is the output from consolidate_all_archaics_tgp_sgdp_windows_v_revisions.py and compute_region_effective_sequence_lengths_v_revisions.py found here.
Paths
den_masked_no_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/den_masked_no_aa/72kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/den_masked_no_aa/72kb_esl_qced_nonoverlapping_{invariant,variant}_windows.txt.gz./muc19/windowing/den_masked_no_aa/{742,72}kb_eff_seq_len.txt.gz
alt_masked_no_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/alt_masked_no_aa/72kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/alt_masked_no_aa/72kb_esl_qced_nonoverlapping_{invariant,variant}_windows.txt.gz./muc19/windowing/alt_masked_no_aa/{742,72}kb_eff_seq_len.txt.gz
cha_masked_no_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/cha_masked_no_aa/72kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/cha_masked_no_aa/72kb_esl_qced_nonoverlapping_{invariant,variant}_windows.txt.gz./muc19/windowing/cha_masked_no_aa/{742,72}kb_eff_seq_len.txt.gz
vin_masked_no_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/vin_masked_no_aa/72kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/vin_masked_no_aa/72kb_esl_qced_nonoverlapping_{invariant,variant}_windows.txt.gz./muc19/windowing/vin_masked_no_aa/{742,72}kb_eff_seq_len.txt.gz
72kb_nonoverlapping_{invariant,variant}_windows.csv.gz
One file for invariant (_invariant_) and one for variant (_variant_) 72kb non-overlapping windows, with the following columns:
IDX: Window index with respect to theCHRcolumn.CHR: Chromosome.START: Start position (inclusive).STOP: Stop position (inclusive).{ARC}: Effective sequence length with respect to the given archaic individual (e.g.,DENfor Altai Denisovan).S: Number of segregating sites.QC: QC condition (numeric, e.g., 1 if true, 0 if false).
Note: If only column headers are present in the invariant file (_invariant_), no invariant windows passed initial QC.
72kb_esl_qced_nonoverlapping_{invariant,variant}_windows.txt.gz
Indices of non-overlapping 72kb invariant (_invariant_) and variant (_variant_) windows of comparable effective sequence length, used for subsetting the corresponding 72kb_nonoverlapping_{invariant,variant}_windows.csv.gz file.
{742,72}kb_eff_seq_len.txt.gz
Effective sequence lengths for the focal regions ({742,72}kb) with respect to the given archaic individual—singe value per file.
tgp_arcs_masked_no_aa_window_info_and_eff_seq_len.tar.gz
tgp_arcs_masked_no_aa_window_info_and_eff_seq_len.tar.gz is the output from consolidate_all_archaics_tgp_sgdp_windows_v_revisions.py and compute_region_effective_sequence_lengths_v_revisions.py found here.
Paths
tgp_arcs_masked_no_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/tgp_arcs_masked_no_aa/{742,72}kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/tgp_arcs_masked_no_aa/{742,72}kb_esl_qced_nonoverlapping_variant_windows.txt.gz./muc19/windowing/tgp_arcs_masked_no_aa/{742,72}kb_eff_seq_len.txt.gz
{742,72}kb_nonoverlapping_{invariant,variant}_windows.csv.gz
One file for invariant (_invariant_) and one for variant (_variant_) per window size ({742,72}kb), with the following columns:
IDX: Window index with respect to theCHRcolumn.CHR: Chromosome.START: Start position (inclusive).STOP: Stop position (inclusive).ALT: Effective sequence length for Altai Neanderthal comparisons.CHA: Effective sequence length for Chagyrskaya Neanderthal comparisons.VIN: Effective sequence length for Vindija Neanderthal comparisons.DEN: Effective sequence length for Altai Denisovan comparisons.DEN-ALT: Effective sequence length for Altai Denisovan-Altai Neanderthal comparisons.DEN-CHA: Effective sequence length for Altai Denisovan-Chagyrskaya Neanderthal comparisons.DEN-VIN: Effective sequence length for Altai Denisovan-Vindija Neanderthal comparisons.ALT-CHA: Effective sequence length for Altai Neanderthal-Chagyrskaya Neanderthal comparisons.ALT-VIN: Effective sequence length for Altai Neanderthal-Vindija Neanderthal comparisons.CHA-VIN: Effective sequence length for Chagyrskaya Neanderthal-Vindija Neanderthal comparisons.S: Number of segregating sites.QC: QC condition (numeric, e.g., 1 if true, 0 if false).
Note: If only column headers are present in the invariant file (_invariant_), no invariant windows passed initial QC.
{742,72}kb_esl_qced_nonoverlapping_variant_windows.txt.gz
Indices of variant windows per window size ({742,72}kb) of comparable effective sequence length, used for subsetting {742,72}kb_nonoverlapping_variant_windows.csv.gz.
{742,72}kb_eff_seq_len.txt.gz
Effective sequence lengths for the focal regions ({742,72}kb) with the following columns:
- Column 1: Effective sequence length for Altai Neanderthal comparisons.
- Column 2: Effective sequence length for Chagyrskaya Neanderthal comparisons.
- Column 3: Effective sequence length for Vindija Neanderthal comparisons.
- Column 4: Effective sequence length for Altai Denisovan comparisons.
- Column 5: Effective sequence length for Altai Denisovan-Altai Neanderthal comparisons.
- Column 6: Effective sequence length for Altai Denisovan-Chagyrskaya Neanderthal comparisons.
- Column 7: Effective sequence length for Altai Denisovan-Vindija Neanderthal comparisons.
- Column 8: Effective sequence length for Altai Neanderthal-Chagyrskaya Neanderthal comparisons.
- Column 9: Effective sequence length for Altai Neanderthal-Vindija Neanderthal comparisons.
- Column 10: Effective sequence length for Chagyrskaya Neanderthal-Vindija Neanderthal comparisons.
tgp_{den,alt,cha,vin}_masked_no_aa_window_info_and_eff_seq_len.tar.gz
tgp_{den,alt,cha,vin}_masked_no_aa_window_info_and_eff_seq_len.tar.gz is the output from consolidate_all_archaics_tgp_sgdp_windows_v_revisions.py and compute_region_effective_sequence_lengths_v_revisions.py found here.
Paths
tgp_den_masked_no_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/tgp_den_masked_no_aa/{742,72}kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/tgp_den_masked_no_aa/{742,72}kb_esl_qced_nonoverlapping_variant_windows.txt.gz./muc19/windowing/tgp_den_masked_no_aa/{742,72}kb_eff_seq_len.txt.gz
tgp_alt_masked_no_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/tgp_alt_masked_no_aa/{742,72}kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/tgp_alt_masked_no_aa/{742,72}kb_esl_qced_nonoverlapping_variant_windows.txt.gz./muc19/windowing/tgp_alt_masked_no_aa/{742,72}kb_eff_seq_len.txt.gz
tgp_cha_masked_no_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/tgp_cha_masked_no_aa/{742,72}kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/tgp_cha_masked_no_aa/{742,72}kb_esl_qced_nonoverlapping_variant_windows.txt.gz./muc19/windowing/tgp_cha_masked_no_aa/{742,72}kb_eff_seq_len.txt.gz
tgp_vin_masked_no_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/tgp_vin_masked_no_aa/{742,72}kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/tgp_vin_masked_no_aa/{742,72}kb_esl_qced_nonoverlapping_variant_windows.txt.gz./muc19/windowing/tgp_vin_masked_no_aa/{742,72}kb_eff_seq_len.txt.gz
{742,72}kb_nonoverlapping_{invariant,variant}_windows.csv.gz
One file for invariant (_invariant_) and one for variant (_variant_) per window size ({742,72}kb), with the following columns:
IDX: Window index with respect to theCHRcolumn.CHR: Chromosome.START: Start position (inclusive).STOP: Stop position (inclusive).{ARC}: Effective sequence length with respect to the given archaic individual (e.g.,DENfor Altai Denisovan).S: Number of segregating sites.QC: QC condition (numeric, e.g., 1 if true, 0 if false).
Note: If only column headers are present in the invariant file (_invariant_), no invariant windows passed initial QC.
{742,72}kb_esl_qced_nonoverlapping_variant_windows.txt.gz
Indices of variant windows per window size ({742,72}kb) of comparable effective sequence length, used for subsetting {742,72}kb_nonoverlapping_variant_windows.csv.gz.
{742,72}kb_eff_seq_len.txt.gz
Effective sequence lengths for the focal regions ({742,72}kb) with respect to the given archaic individual—singe value per file.
tgp_{den,alt,cha,vin}_masked_aa_window_info_and_eff_seq_len.tar.gz
tgp_{den,alt,cha,vin}_masked_aa_window_info_and_eff_seq_len.tar.gz is the output from consolidate_all_archaics_tgp_sgdp_windows_v_revisions.py and compute_region_effective_sequence_lengths_v_revisions.py found here.
Paths
tgp_den_masked_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/tgp_den_masked_aa/{742,72}kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/tgp_den_masked_aa/{742,72}kb_esl_qced_nonoverlapping_variant_windows.txt.gz./muc19/windowing/tgp_den_masked_aa/{742,72}kb_eff_seq_len.txt.gz
tgp_alt_masked_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/tgp_alt_masked_aa/{742,72}kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/tgp_alt_masked_aa/{742,72}kb_esl_qced_nonoverlapping_variant_windows.txt.gz./muc19/windowing/tgp_alt_masked_aa/{742,72}kb_eff_seq_len.txt.gz
tgp_cha_masked_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/tgp_cha_masked_aa/{742,72}kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/tgp_cha_masked_aa/{742,72}kb_esl_qced_nonoverlapping_variant_windows.txt.gz./muc19/windowing/tgp_cha_masked_aa/{742,72}kb_eff_seq_len.txt.gz
tgp_vin_masked_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/tgp_vin_masked_aa/{742,72}kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/tgp_vin_masked_aa/{742,72}kb_esl_qced_nonoverlapping_variant_windows.txt.gz./muc19/windowing/tgp_vin_masked_aa/{742,72}kb_eff_seq_len.txt.gz
{742,72}kb_nonoverlapping_{invariant,variant}_windows.csv.gz
One file for invariant (_invariant_) and one for variant (_variant_) per window size ({742,72}kb), with the following columns:
IDX: Window index with respect to theCHRcolumn.CHR: Chromosome.START: Start position (inclusive).STOP: Stop position (inclusive).{ARC}: Effective sequence length with respect to the given archaic individual (e.g.,DENfor Altai Denisovan).S: Number of segregating sites.QC: QC condition (numeric, e.g., 1 if true, 0 if false).
Note: If only column headers are present in the invariant file (_invariant_), no invariant windows passed initial QC.
{742,72}kb_esl_qced_nonoverlapping_variant_windows.txt.gz
Indices of variant windows per window size ({742,72}kb) of comparable effective sequence length, used for subsetting {742,72}kb_nonoverlapping_variant_windows.csv.gz.
{742,72}kb_eff_seq_len.txt.gz
Effective sequence lengths for the focal regions ({742,72}kb) with respect to the given archaic individual—singe value per file.
sgdp_{den,alt,cha,vin}_masked_no_aa_eff_seq_len.tar.gz
sgdp_{den,alt,cha,vin}_masked_no_aa_eff_seq_len.tar.gz is the output from compute_region_effective_sequence_lengths_v_revisions.py found here.
Paths
sgdp_den_masked_no_aa_eff_seq_len.tar.gz./muc19/windowing/sgdp_den_masked_no_aa/{742,72}kb_eff_seq_len.txt.gz
sgdp_alt_masked_no_aa_eff_seq_len.tar.gz./muc19/windowing/sgdp_alt_masked_no_aa/{742,72}kb_eff_seq_len.txt.gz
sgdp_cha_masked_no_aa_eff_seq_len.tar.gz./muc19/windowing/sgdp_cha_masked_no_aa/{742,72}kb_eff_seq_len.txt.gz
sgdp_vin_masked_no_aa_eff_seq_len.tar.gz./muc19/windowing/sgdp_vin_masked_no_aa/{742,72}kb_eff_seq_len.txt.gz
{742,72}kb_eff_seq_len.txt.gz
Effective sequence lengths for the focal regions ({742,72}kb) with respect to the given archaic individual—singe value per file.
tgp_mod_no_aa_window_info_and_eff_seq_len.tar.gz
tgp_mod_no_aa_window_info_and_eff_seq_len.tar.gz is the output from consolidate_all_archaics_tgp_sgdp_windows_v_revisions.py and compute_region_effective_sequence_lengths_v_revisions.py found here.
Paths
tgp_mod_no_aa_window_info_and_eff_seq_len.tar.gz./muc19/windowing/tgp_mod_no_aa/{742,72}kb_nonoverlapping_{invariant,variant}_windows.csv.gz./muc19/windowing/tgp_mod_no_aa/{742,72}kb_esl_qced_nonoverlapping_variant_windows.txt.gz./muc19/windowing/tgp_mod_no_aa/{742,72}kb_eff_seq_len.txt.gz
{742,72}kb_nonoverlapping_{invariant,variant}_windows.csv.gz
One file for invariant (_invariant_) and one for variant (_variant_) per window size ({742,72}kb), with the following columns:
IDX: Window index with respect to theCHRcolumn.CHR: Chromosome.START: Start position (inclusive).STOP: Stop position (inclusive).HUM: Effective sequence length.S: Number of segregating sites.QC: QC condition (numeric, e.g., 1 if true, 0 if false).
Note: If only column headers are present in the invariant file (_invariant_), no invariant windows passed initial QC.
{742,72}kb_esl_qced_nonoverlapping_variant_windows.txt.gz
Indices of variant windows per window size ({742,72}kb) of comparable effective sequence length, used for subsetting {742,72}kb_nonoverlapping_variant_windows.csv.gz.
{742,72}kb_eff_seq_len.txt.gz
Effective sequence lengths for the focal regions ({742,72}kb) with respect to the given archaic individual—singe value per file.
sgdp_denisovan_intro_tracts_in_papuans_qc_info.tar.gz
sgdp_denisovan_intro_tracts_in_papuans_qc_info.tar.gz is the output from consolidate_sgdp_denisovan_intro_tracts_in_papuans_v_revisions.py found here.
Paths
sgdp_denisovan_intro_tracts_in_papuans_qc_info.tar.gz./muc19/windowing/sgdp_den_masked_no_aa/{pap_ind}_hap{1,2}_den_intro_{invariant,variant}_tracts.csv.gz
sgdp_denisovan_intro_tracts_in_papuans_qc_info.tar.gz
For each Papuan individual ({pap_ind}), there is one file for invariant (_invariant_) and one for variant (_variant_) for each haploid genome (hap{1,2}), with the following columns:
IDX: Denisovan introgressed tract index with respect to theCHRcolumn.CHR: Chromosome.START: Start position (inclusive).STOP: Stop position (inclusive).DEN: Effective sequence length for Altai Denisovan comparisons.S: Number of segregating sites.QC: QC condition (numeric, e.g., 1 if true, 0 if false).
All code, meta information, and final results can be viewed on GitHub. These data are the intermediary files used to produce our final set of results.
