Datasets associated with heterozygosity analyses in Zoonomia Consortium's 2020 Nature paper
Data files
Mar 12, 2026 version files 4.62 GB
Abstract
The Zoonomia Project investigates the genomic basis of both shared and specialized traits across eutherian mammals. Here, as a resource for biodiversity conservation, we provide comprehensive files of high-quality variant sites identified in 127 genome assemblies that were described in our publication. The files include only heterozygous positions that passed quality filtering. These catalogs may support the development of cost-effective and accurate genetic assays that remain robust even when DNA quality is low. Such assays are often preferable to designing expensive custom tools, relying on assays developed for related species, or sequencing random regions of the genome. These data were originally shared via an FTP site linked to broad.io/variants. This repository represents a stable, long-term archive.
Dataset DOI: 10.5061/dryad.g79cnp646
Description of the data and file structure
Files and variables
File: genome_info.tsv
Description: information on the files included in the zip archives
Variables
- filename: name of file in zip archive
- subset: zip archive
- species: binomial species name
- common name: common name of species
- Assembly.Name: NCBI Assembly name
- Assembly: NCBI Assembly ID
- BioProject: NCBI BioProject ID
- BioSample: NCBI BioSample ID
- sra: NCBI SRA ID. "NA" indicates there is no SRA entry for this assembly.
File type: Variant list file
Variables in variant list files
-
CHROM: Chromosome (scaffold) identifier from the reference genome
-
POS: The 1-based position of the variant on the reference scaffold
-
REF: The reference base(s) (e.g., A, C, T, G)
-
ALT: Comma-separated list of alternate non-reference alleles.
-
GQ: Genotype Quality. Phred-scaled quality score
-
DP: Depth. The total number of high-quality reads covering the position.
File: set1.zip
Description: Zip file of variant lists for 10 genome assemblies
- Antilocapra_americana_dovetail.highconf_het_sites.txt.gz
- AteGeo_1.highconf_het_sites.txt.gz
- AplRuf_1.highconf_het_sites.txt.gz
- ArtJam_U092.highconf_het_sites.txt.gz
- AnrPal_U0022.highconf_het_sites.txt.gz
- AnoCau_U021.highconf_het_sites.txt.gz
- AcoCah_U019.highconf_het_sites.txt.gz
- BraVar_1.highconf_het_sites.txt.gz
- AloPal_1.highconf_het_sites.txt.gz
- BeaHun_1.highconf_het_sites.txt.gz
File: set2.zip
Description: Zip file of variant lists for 10 genome assemblies
- Catagonus_wagneri_dovetail.highconf_het_sites.txt.gz
- CanFam_VD1.highconf_het_sites.txt.gz
- CarPer_U073.highconf_het_sites.txt.gz
- CebAlb_1.highconf_het_sites.txt.gz
- CapPil_U015.highconf_het_sites.txt.gz
- CheMed_1.highconf_het_sites.txt.gz
- CavTsc_U050.highconf_het_sites.txt.gz
- CasCan_U014.highconf_het_sites.txt.gz
- CerCot_1.highconf_het_sites.txt.gz
- CertNeg_1.highconf_het_sites.txt.gz
File: set3.zip
Description: Zip file of variant lists for 10 genome assemblies
- CriGam_U082.highconf_het_sites.txt.gz
- CroInd_U011.highconf_het_sites.txt.gz
- ChoDid_U054.highconf_het_sites.txt.gz
- DasPun_1.highconf_het_sites.txt.gz
- CteGun_U049.highconf_het_sites.txt.gz
- DauMad_1.highconf_het_sites.txt.gz
- CteSoc_1.highconf_het_sites.txt.gz
- CryFer_1.highconf_het_sites.txt.gz
- CraTho_U097.highconf_het_sites.txt.gz
- CunPac_1.highconf_het_sites.txt.gz
File: set4.zip
Description: Zip file of variant lists for 10 genome assemblies
- Diceros_bicornis_dovetail.highconf_het_sites.txt.gz
- EubJap_1.highconf_het_sites.txt.gz
- EleEdw_1.highconf_het_sites.txt.gz
- DinBra_U070.highconf_het_sites.txt.gz
- DipSte_1.highconf_het_sites.txt.gz
- FelNig_1.highconf_het_sites.txt.gz
- DolPat_U062.highconf_het_sites.txt.gz
- EulFul_1.highconf_het_sites.txt.gz
- EscRob_1.highconf_het_sites.txt.gz
- EryPat_1.highconf_het_sites.txt.gz
File: set5.zip
Description: Zip file of variant lists for 10 genome assemblies
- HelPar_U056.highconf_het_sites.txt.gz
- HipGal_U101.highconf_het_sites.txt.gz
- GliGli_U078.highconf_het_sites.txt.gz
- HysCri_U0031.highconf_het_sites.txt.gz
- HyaHya_1.highconf_het_sites.txt.gz
- GraMur_U067.highconf_het_sites.txt.gz
- HetBruBak_U064.highconf_het_sites.txt.gz
- GalVar_U047.highconf_het_sites.txt.gz
- HydHyd_U065.highconf_het_sites.txt.gz
- Hippopotamus_amphibius_dovetail.highconf_het_sites.txt.gz
File: set6.zip
Description: Zip file of variant lists for 10 genome assemblies
- LemCat_1.highconf_het_sites.txt.gz
- KogBre_1.highconf_het_sites.txt.gz
- ManTri_1.highconf_het_sites.txt.gz
- MacCal_U035.highconf_het_sites.txt.gz
- IndInd_1.highconf_het_sites.txt.gz
- MegLyr_U036.highconf_het_sites.txt.gz
- MacSob_U102.highconf_het_sites.txt.gz
- LepAme_1.highconf_het_sites.txt.gz
- LasBor_U024.highconf_het_sites.txt.gz
- IniGeo_1.highconf_het_sites.txt.gz
File: set7.zip
Description: Zip file of variant lists for 10 genome assemblies
- MerUng_U013.highconf_het_sites.txt.gz
- MizCoq_1.highconf_het_sites.txt.gz
- MonMon_3.highconf_het_sites.txt.gz
- MesBid_U080.highconf_het_sites.txt.gz
- Moschus_moschiferus_dovetail.highconf_het_sites.txt.gz
- MelCap_1.highconf_het_sites.txt.gz
- MorMeg_U096.highconf_het_sites.txt.gz
- MinSch_U075.highconf_het_sites.txt.gz
- MirAng_1.highconf_het_sites.txt.gz
- MicHir_U037.highconf_het_sites.txt.gz
File: set8.zip
Description: Zip file of variant lists for 10 genome assemblies
- MurFea_U074.highconf_het_sites.txt.gz
- MunMunMun_U063.highconf_het_sites.txt.gz
- MusAve_U077.highconf_het_sites.txt.gz
- NasLar_1.highconf_het_sites.txt.gz
- MyoCoy_U051.highconf_het_sites.txt.gz
- NocLep_U093.highconf_het_sites.txt.gz
- HemHyl_1.highconf_het_sites.txt.gz
- MicTal_U118.highconf_het_sites.txt.gz
- MyrTri_1.highconf_het_sites.txt.gz
- MyoMyo_U100.highconf_het_sites.txt.gz
File: set9.zip
Description: Zip file of variant lists for 10 genome assemblies
- NycHum_U0023.highconf_het_sites.txt.gz
- PedCap_U0032.highconf_het_sites.txt.gz
- PanOnc_1.highconf_het_sites.txt.gz
- OnyTor_U017.highconf_het_sites.txt.gz
- OviCan_1.highconf_het_sites.txt.gz
- NycCou_1.highconf_het_sites.txt.gz
- AllBul_U116.highconf_het_sites.txt.gz
- OndZib_U110.highconf_het_sites.txt.gz
- ParHer_1.highconf_het_sites.txt.gz
- OryAfe_1.highconf_het_sites.txt.gz
File: set10.zip
Description: Zip file of variant lists for 10 genome assemblies
- PitPit_1.highconf_het_sites.txt.gz
- PlaMin_1.highconf_het_sites.txt.gz
- CalDon_1.highconf_het_sites.txt.gz
- PipPip_U040.highconf_het_sites.txt.gz
- PteBra_1.highconf_het_sites.txt.gz
- ProCapCap_U060.highconf_het_sites.txt.gz
- PerPac_1.highconf_het_sites.txt.gz
- PetTyp_U052.highconf_het_sites.txt.gz
- PhoPho_1.highconf_het_sites.txt.gz
- PonBla_1.highconf_het_sites.txt.gz
File: set11.zip
Description: Zip file of variant lists for 10 genome assemblies
- RhiFer_U033.highconf_het_sites.txt.gz
- RanTarSib_U061.highconf_het_sites.txt.gz
- ScaAqu_U009.highconf_het_sites.txt.gz
- RouAeg_U006.highconf_het_sites.txt.gz
- SemEnt_1.highconf_het_sites.txt.gz
- PygNem_1.highconf_het_sites.txt.gz
- SaiTat_1.highconf_het_sites.txt.gz
- SigHis_U029.highconf_het_sites.txt.gz
- SolPar_U079.highconf_het_sites.txt.gz
- SagImp_1.highconf_het_sites.txt.gz
File: set12.zip
Description: Zip file of variant lists for 10 genome assemblies
- ThrSwi_U090.highconf_het_sites.txt.gz
- TapInd_1.highconf_het_sites.txt.gz
- TamTet_U055.highconf_het_sites.txt.gz
- TapTer_1.highconf_het_sites.txt.gz
- TonSau_U038.highconf_het_sites.txt.gz
- TolMat_U069.highconf_het_sites.txt.gz
- SpiGra_U072.highconf_het_sites.txt.gz
- TadBra_U034.highconf_het_sites.txt.gz
- SurSur_1.highconf_het_sites.txt.gz
- Tragulus_javanicus_dovetail.highconf_het_sites.txt.gz
File: set13.zip
Description: Zip file of variant lists for 7 genome assemblies
- ZapHud_1.highconf_het_sites.txt.gz
- VulLag_U066.highconf_het_sites.txt.gz
- TupTan_1.highconf_het_sites.txt.gz
- ZalCal_1.highconf_het_sites.txt.gz
- XerIna_U057.highconf_het_sites.txt.gz
- ZipCav_1.highconf_het_sites.txt.gz
- UroGra_U010.highconf_het_sites.txt.gz
Code/software
None. Data is gzipped tab-delimited text files in zip archives.
Access information
Other publicly accessible locations of the data:
- n/a
Data was derived from the following sources:
- Genome assemblies publicly available through NBCI without restrictions on use. No license. Assembly identifiers are listed in the accompanying genome_info.tsv file.
We applied the standard GATK pipeline with genotype quality banding to identify the callable fraction of the genome(1,2). First, we used samtools to subsample paired reads from the unmapped .bam files (3). After removing adaptor sequences from the selected reads, we used BWA-MEM to map reads to the reference genome scaffolds of >10 kb, removing duplicates using the PicardTools MarkDuplicates utility (4). We then called heterozygous sites using standard GATK-Haplotypecaller specifications, and with additional gVCF banding at 0, 10, 20, 30, 40, 50 and 99 qualities. We used the fraction of the genome with genotype quality >15 for subsequent analyses. For the lists of high-confidence variant sites, we include only heterozygous positions after filtering at GQ >20, maximum DP <100, minimum DP >6. Each file include the contig/scaffold name, position of the variant, the alleles observed, and the GQ and DP scores.
1. McKenna, A. et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
2. DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
3. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
4. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
