Data from: FLARE2: Local ancestry inference with poorly-matched reference panels
Data files
Mar 03, 2026 version files 5.86 GB
-
chr1to22.mozabite.filtered.postcluster2.anc.vcf.gz
90.60 MB
-
chrXdiploid.mozabite.filtered.postcluster2.anc.vcf.gz
2.28 MB
-
data_for_fig2.tsv
948 B
-
data_for_fig3a.tsv
4.23 KB
-
data_for_fig3b.tsv
1.57 KB
-
data_for_fig4a.tsv
6.81 KB
-
data_for_fig4b.tsv
6.51 KB
-
data_for_fig4c.tsv
4.92 KB
-
hgdp_wgs.20190516.metadata.txt
109.54 KB
-
hgdp.chr1to22.vcf.gz
3.50 GB
-
hgdp.chrX.phased.vcf.gz
98.53 MB
-
hgdp.readme
922 B
-
README.md
4.63 KB
-
sim_gts_rep1_phased.vcf.gz
715.94 MB
-
sim_gts_rep2_phased.vcf.gz
717.65 MB
-
sim_gts_rep3_phased.vcf.gz
731.86 MB
-
sim_sample_map.txt
132.89 KB
-
sim.readme
250 B
Abstract
The original FLARE method provides computationally efficient and highly accurate local ancestry inference in cases where a closely-matched reference panel is available for each ancestry. In this work, we extend FLARE to incorporate a haplotype clustering algorithm that enables accurate local ancestry inference in scenarios where one or more ancestries do not have a closely-matched reference. This method retains the computational efficiency and accuracy of the original FLARE method while greatly extending its applicability. We apply the new method to data from the Mozabite population from the Human Genome Diversity Project. On the autosomes, we find that the Mozabite samples derive 67% of their ancestry from a population related to European and Middle Eastern populations, with the other 33% of their ancestry coming from a population related to West African populations, with an admixture time 48 generations ago. In contrast, on the X chromosome, we find that the individuals have 76% of their ancestry from a population related to European and Middle Eastern populations.
Dataset DOI: 10.5061/dryad.bk3j9kdrk
Description of the data and file structure
This dataset contains data used in the manuscript "Local ancestry inference with poorly-matched reference panels" by SR Browning, SD Temple, and BL Browning (2025/2026). This dataset includes data underlying the figures, phased genotype data, and scripts used to generate the results.
Simulated genotype data were created, local ancestry was called, and MDS plots were made. HGDP Mozabite genotype data were phased, local ancestry was called, and MDS plots were made.
The preprint for the paper is located at: https://www.biorxiv.org/content/10.1101/2025.10.13.681993
Files and variables
File: data_for_fig3a.tsv
Description: Data plotted in Figure 3A of paper
Variables
- POP/ANC: Reference population, or ancestry (for admixed population)
- MDS1: 1st dim of MDS
- MDS2: 2nd dim of MDS
- MDS3: 3rd dim of MDS
- MDS4: 4th dim of MDS
File: data_for_fig3b.tsv
Description: Data plotted in Figure 3B of the paper
Variables
- POP/ANC: Reference population, or ancestry (for admixed population)
- MDS1: 1st dim of MDS
- MDS2: 2nd dim of MDS
- MDS3: 3rd dim of MDS
- MDS4: 4th dim of MDS
File: data_for_fig2.tsv
Description: Data plotted in Figure 2 of the paper
Variables
- Scenario: Scenarios match those in the figure
- Replicate: Three replicates of each scenario
- Clust.FLARE: Accuracy for Clustered FLARE
- Orig.FLARE: Accuracy for Original FLARE
- MOSAIC: Accuracy for MOSAIC
- RFMix: Accuracy for RFMix
File: hgdp_wgs.20190516.metadata.txt
Description: For each individual in the HGDP set, provides information. For this study, only the sample and population are relevant.
Variables
- sample: Sample ID
- library: Library
- sample_accession: Sample Accession ID
- source: Source of sequence data ("sanger", "sgdp", or "meyer2012")
- library_type: Type of library ("PCR" or "PCRfree")
- population: Population name, e.g. "Mozabite"
- latitude: Latitude (degrees)
- longitude: Longitude (degrees)
- region: Continental region of the population, one of: AFRICA, AMERICA, CENTRAL_SOUTH_ASIA, EAST_ASIA, EUROPE, MIDDLE_EAST, OCEANIA
- sex: F (female) or M (male)
- coverage: Average sequencing coverage
- freemix: Freemix score (estimated level of DNA contamination)
- capmq: Capmq score (maximum allowable Mapping Quality)
- insert_size_average: Average length of sequenced fragments
- array_non_reference_discordance: Non-reference discordance with array data (NA for some individuals)
- library_alias_ENA: Library alias in the European Nucleotide Archive
File: hgdp.readme
Description: Information about the phased HGDP data
File: hgdp.chrX.phased.vcf.gz
Description: Filtered and phased HGDP genotype data for Chr X
File: hgdp.chr1to22.vcf.gz
Description: Filtered and phased HGDP genotype data for Chr 1-22
File: data_for_fig4a.tsv
Description: Data for Figure 4A of the paper. There is no header. The first column is population/ancestry, final four columns are the four MDS dimensions.
File: data_for_fig4b.tsv
Description: Data for Figure 4b, same format as for Figure 4a
File: data_for_fig4c.tsv
Description: Data for Figure 4c, same format as for Figure 4a
File: sim_sample_map.txt
Description: For each simulated individual IDs begin with "tsk" in the first column, and the corresponding population identifier is given in the second column. See the sim.readme file for information about the populations.
File: sim_gts_rep3_phased.vcf.gz
Description: Phased simulated data, 3rd replicate
File: sim.readme
Description: Information about the simulated data
File: chrXdiploid.mozabite.filtered.postcluster2.anc.vcf.gz
Description: Local ancestry calls for the HGDP Mozabite, Chr X
File: chr1to22.mozabite.filtered.postcluster2.anc.vcf.gz
Description: Local ancestry calls for the HGDP Mozabite, Chr 1-22
File: sim_gts_rep2_phased.vcf.gz
Description: Phased simulated data, 2nd replicate
File: sim_gts_rep1_phased.vcf.gz
Description: Phased simulated data, 1st replicate
Code/software
https://github.com/browning-lab/flare
Access information
Data was derived from the following sources:
- HGDP data were obtained from another source; see the HGDP readme file.
