This README.txt files was generated on 2022-10-06 by David Tian GENERAL INFORMATION 1. Title of Dataset: Data from Severe inbreeding, increased mutation load, and gene loss-of-function in the critically endangered Devil’s Hole pupfish 2. Author Information A. Investigator 1 Name: David Tian Institution: University of California, Berkeley Email: davidtian@berkeley.edu B. Investigator 2 Name: Austin H Patton Institution: University of California, Berkeley Email: austinhpatton@berkeley.edu C. Investigator 3 Name: Bruce Turner Institution: Virginia Tech Email: fishgen@vt.edu D. Investigator 4 Name: Christopher H Martin Institution: University of California, Berkeley Email: chmartin@berkeley.edu 3. Date of data collection: 1937-2012 4. Geographic location of data collection: American Southwest, Death Valley National Park, Ash Meadows National Wildlife Refuge, Rio Yaqui Basin, Mexico 5. Information about funding sources that supported the collection of the data: This work was funded by the U.S. Fish and Wildlife Service, National Park Service, National Science Foundation DEB CAREER grant #1749764, National Institutes of Health grant 5R01DE027052-02, and the University of California, Berkeley to CHM. DT was supported by a National Science Foundation Graduate Research Fellowship (DGE 1752814). SHARING/ACCESS INFORMATION 1. Licenses/restrictions placed on the data: 2. Links to publications that cite or use the data: 3. Links to other publicly accessible locations of the data: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA887195 4. Links/relationships to ancillary data sets: 5. Was data derived from another source? No. 6. Recommended citation for this dataset: DATA & FILE OVERVIEW 1. Description of the dataset: These data were generated to assess inbreeding and mutation load in Death Valley Cyprinodon pupfishes File 1 Name: ROHs.zip File 1 Description: ROHs for each individual in BED file format. File 2 Name: Deletions.zip File 2 Description: Deletions that are unique to a given Cyprindon diabolis individual and not found in any Cyprinodon nevadensis or Cyprinodon salinus individuals. File 3 Name: final.allgoodDHP.vs.final.nev.weir.fst File 3 Description: Output of vcftools fst. File 4 Name: Admixture.zip File 4 Description: Contains ADMIXTURE analysis output of K parameters (number of distinct groups) 1-20. File 5 Name: PCA.zip File 5 Description: Output files for PCA analyses of genetic variation based on genome-wide SNPs. File 6 Name: DHP_census_count.csv File 6 Description: Contains spring and fall counts of Cyprinodon diabolis census population size from 1972-2019. File 7 Name: SYN.NSYN.LOF.geno.pro.xlsx File 7 Description: Genotype proportions (homozygous ancestral, heterozygous, homozygous derived) across variant classes (synonymous, non-synonymous, loss--of-function). File 8 Name: pupfish.vcf.gz File 8 Description: VCF file of genotype data for all samples. DATA-SPECIFIC INFORMATION FOR: ROHs.zip This folder contains: POR1_ROHS.bed POR3_ROHS.bed DHP54907_ROHS.bed CNevAma_ROHS.bed CNEVNEV_SARATOGASPRING_8_22_ROHS.bed CNEVAMA_CHINA_4_3_ROHS.bed CNEVAMA_VALLEYSPRING_7_19_ROHS.bed CNEVAMA_TECOPABORE_2_23_ROHS.bed CNEVAMA_TECOPA_3_16_ROHs.bed CNEVPEC_INDIANSPRING_ROHS.bed CSALSAL1_ROHS.bed CSALMIL2_ROHS.bed CNEVMIO_BIGSPRING_ROHS.bed CNEVSHO_1_23_ROHS.bed DHP54920_N2_ROHS.bed DHP54918_ROHS.bed CNEVSHO_HSPRING_6_38_ROHS.bed CSALMIL1_ROHS.bed DHP54917_ROHS.bed DHP54913_ROHS.bed CNEVPEC_SCHOOLSPRING_12_3_ROHS.bed DHP54903_ROHS.bed DHP1980.5_ROHS.bed DIAB54919_ROHS.bed CSALSAL2_ROHS.bed 1. Number of variables: 3 2. Number of cases/rows: variable, depending on individual file 3. Variable List: Scaffold: Scaffold where ROH is located Beginning genomic coordinate: Basepair where ROH begins Ending genomic coordinate: Basepair where ROH ends 4. Missing data codes: None 5. Abbreviations used: N/A; not applicable 6. Other relevant information: DATA-SPECIFIC INFORMATION FOR: Deletions.zip This folder contains: DIAB54919_DEL_nevadensis_salinus.xlsx DHP1980-5_DUP_nevadensis_salinus.xlsx DHP54917_DEL_nevadensis_salinus.xlsx DHP54913_DEL_nevadensis_salinus.xlsx DHP54903_DEL_nevadensis_salinus.xlsx 1. Number of variables: 3 2. Number of cases/rows: variable, depending on individual file 3. Variable List: chr: Scaffold on which the deletion begins start: Basepair where deletion begins chr2: Scaffold on which the deletion ends end: Basepair where deletion ends id: ID for deletion size: Size of deletion 4. Missing data codes: None 5. Abbreviations used: N/A; not applicable 6. Other relevant information: DATA-SPECIFIC INFORMATION FOR: final.allgoodDHP.vs.final.nev.weir.fst 1. Number of variables: 3 2. Number of cases/rows: 6129052 3. Variable List: CHROM: Scaffold POS: Basepair position WEIR_AND_COCKERHAM_FST: Fst value for given SNP when comparing Cyprindon diabolis and Cyprinodon nevadensis populations. 4. Missing data codes: None 5. Abbreviations used: N/A; not applicable 6. Other relevant information: DATA-SPECIFIC INFORMATION FOR: Admixture.zip This folder contains: ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.10.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.1.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.2.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.3.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.4.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.5.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.6.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.7.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.8.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.9.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.11.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.12.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.13.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.14.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.15.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.16.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.17.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.18.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.19.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.20.Q ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.bed ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.1.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.2.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.3.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.4.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.5.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.6.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.7.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.8.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.9.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.10.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.11.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.12.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.13.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.14.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.15.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.16.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.17.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.18.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.19.P ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.thin100k.final.20.P For files that end in .P and using the .6.P file as an example 1. Number of variables: 6 2. Number of cases/rows: 100000 3. Variable List: A column for each of the preset populations (so if file ends with .6.P then there are six columns for the six populations) and inferred allele frequency for each ancestral population for the 100000 SNPs inputed from the empirical genetic dataset (1 row per SNP). Allele frequencies are presented as a percentage 4. Missing data codes: No missing data 5. Specialized formats or other abbreviations used: 6. Other relevant information: For files that end in .Q are files that contain the ancestry fractions estimated for each individual given the K population number parameter used (this is the # following the .Q in the filename) 1. Number of variables: 6 2. Number of cases/rows: 30 3. Variable List: Each column represents the ancestry of each individual assigned to the K assigned population #. For example if the file ends in .6.Q, there are six columns per row which represent an individuals percentage of ancestry coming from that population. Ancestry proportions are presented as a percentage. 4. Missing data codes: No missing data 5. Specialized formats or other abbreviations used: 6. Other relevant information: DATA-SPECIFIC INFORMATION FOR: PCA.zip This folder contains: ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.final.eigenval ADP_FilteredSNPS.passing.SNPableMask.tmpID.reheader.minDP3.maxDP315.minGQ20.genorate0.5.maf0.05.prune.50.5.0.5.keep.final.eigenvec For files that end in .eigenval 1. Number of variables: 1 2. Number of cases/rows: 20 3. Variable List: Col 1: Percentage of variation in genetic dataset explained by each individual principal component (1 per row starting with PC 1) 4. Missing data codes: No missing data 5. Specialized formats or other abbreviations used: 6. Other relevant information: For files that end in .eigenvec 1. Number of variables: 22 2. Number of cases/rows: 30 3. Variable List: Col 1-2: Individual ID Col 3-22: Position along Principal component axis (PC1-PC20), units: relative raw component scores 4. Missing data codes: No missing data 5. Specialized formats or other abbreviations used: 6. Other relevant information: DATA-SPECIFIC INFORMATION FOR: DHP_census_count.csv 1. Number of variables: 3 2. Number of cases/rows: 96 3. Variable List: Year: Season: fall or spring Count: number of Cyprinodon diabolis counted 4. Missing data codes: NA for missing census count 5. Specialized formats or other abbreviations used: 6. Other relevant information: DATA-SPECIFIC INFORMATION FOR: SYN.NSYN.LOF.geno.pro.xlsx 1. Number of variables: 9 2. Number of cases/rows: 360 3. Variable List: INDV: individual sample Species: species that individual belongs to Coverage: coverage of sample, mapped to reference genome Missing_data: amount of missing data for given individual variant: class of variant identified by snpeff (synonymous, non-synonymous, loss-of-function, gained stop codon) genotype: genotype of variant (homozygous ancestral, heterozygous, homozygous derived) count: number of variants that are have are of a given variant class and genotype total_count: the total number of variants identified by snpeff as part of a given variant class across the three genotypes. proportion: count / total_count 4. Missing data codes: No missing data 5. Specialized formats or other abbreviations used: SYN = synonymous NSYN = non-synonymous LOF = loss-of-function Hom_Anc = homozygous ancestral Het_Der = heterozygous Hom_der = homozygous derived stop_gained = gained stop codon 6. Other relevant information: DATA-SPECIFIC INFORMATION FOR: pupfish.vcf.gz 1. Number of variables: 9 2. Number of cases/rows: 6295414 3. Variable List: CHROM: Scaffold where SNP is located POS: Reference position of SNP ID: Identifier of SNP REF: Reference base ALT: Alternate base QUAL: Phred-scaled quality score of ALT allele FILTER: Filter status - PASS if position has passed all filters. INFO: Additional information regarding variant Please refer to the Variant Call Format Version 4.2 Specification for more details: https://samtools.github.io/hts-specs/VCFv4.2.pdf 4. Missing data codes: ./. represents missing genotype 5. Specialized formats or other abbreviations used: Please refer to the Variant Call Format Version 4.2 Specification for details on abbreviations. https://samtools.github.io/hts-specs/VCFv4.2.pdf 6. Other relevant information: