Skip to main content
Dryad

Genomic signatures of domestication in Old World camels

Cite this dataset

Fitak, Robert et al. (2020). Genomic signatures of domestication in Old World camels [Dataset]. Dryad. https://doi.org/10.5061/dryad.prr4xgxj2

Abstract

Domestication begins with the selection of animals showing less fear of humans.  In most domesticates, selection signals for tameness have been superimposed by intensive breeding for economical or other desirable traits. Old World camels, conversely, have maintained high genetic variation and lack these secondary bottlenecks associated with breed development.  By re-sequencing multiple genomes from dromedaries, Bactrian camels, and their endangered wild relatives, we show that positive selection for candidate genes underlying traits collectively referred to as ‘domestication syndrome’ is consistent with neural crest deficiencies and altered thyroid hormone-based signaling.  Comparing our results with other domestic species, we postulate that the core set of domestication genes is considerably smaller than the pan-domestication set – and overlapping genes are likely a result of chance and redundancy.  These results, along with the extensive genomic resources provided, are an important contribution to understanding the evolutionary history of camels and the genomic features of their domestication.

Methods

All methods that either utilize or generate the files in this repository are documented on GitHub (https://github.com/rfitak/Camel_Genomics).

Usage notes

######################################################
### This README file contains a list of the files  ###
### and descriptions for each file in this Dryad   ###
### repository. NOTE: The md5sum is also given for ###
### each file.  You can verify that you do not     ###
### have a corrupted file by matching the md5sum   ###
### for each file given below with the output of   ###
### UNIX terminal commands like "md5 file" or      ###
### "md5sum file". The various file formats are    ###
### described in appendix 1 at the bottom.  The    ###
### computer code and commands used to generate    ###
### the results are available at my GitHub page:   ###
###    https://github.com/rfitak/Camel_Genomics    ###
###      Thanks for promoting data-sharing!!!      ###
###            Bob Fitak 4/28/2020                 ###
######################################################

# Note 1 #
As mentioned in the manuscript associated with this
dataset, all raw sequencing data can be obtained
through the NCBI BioProject database, under accession
PRJNA276064.

# Note 2 #
Files compressed using gzip (suffix .gz) can be uncompressed
a variety of ways:  by double clicking, or in the UNIX
terminal typing the command `gunzip <file.gz>`

####################################################
###################### Files #######################
####################################################

FILE:  SraRunTable.csv
MD5SUM: 2c6448024189128e762e353ba1fc7c7f
DESCRIPTION:  This file contains the full list of
sequencing data (raw + aligned to reference) accession
numbers in NCBI's SRA database (https://www.ncbi.nlm.nih.gov/sra).
FORMAT: CSV

FILE:  CB1.fasta.gz
MD5SUM: 57ed0e8fcefb09637a9cea54221f0568
DESCRIPTION:  This file contains the CB1 reference
genome sequence for Camelus ferus (also
available from NCBI).
FORMAT: FASTA (compressed with gzip)

FILE:  ref_CB1_proteins.fa.gz
MD5SUM: e94c1316e57d3a47de4fec9b40fc6a0b
DESCRIPTION:  This file contains all the protein 
sequences annotated in the CB1 reference
genome for Camelus ferus (also available
from NCBI).
FORMAT: FASTA (compressed with gzip)

FILE:  ref_CB1_scaffolds.gff3.gz
MD5SUM: 51e3a17b0bce6e4ee74534eb65ebb93f
DESCRIPTION:  This file contains the annotation
information of the reference genome for Camelus
ferus (CB1; from NCBI).
FORMAT: GFF3 (compressed with gzip)

FILE:  BLAST2GO_CFERUS.tar.gz
MD5SUM: 0c81429b505896e37ab0c66343c906fc
DESCRIPTION:  This file contains all the
output annotations created for the C. ferus
reference (CB1) genome using Blast2GO.
FORMAT: TAR.GZ

FILE:  CB1.repeats.bed
MD5SUM: 2380e59d10ed550707a59b980f8b7b92
DESCRIPTION:  This file contains the annotation
of repetitive elements (masked coordinates) for the
reference Camelus ferus genome (CB1; from NCBI).
FORMAT: BED

FILE:  Drom.repeats.bed
MD5SUM: f63983517d7accbaa462388a4ddd3a3b
DESCRIPTION:  This file contains the annotation
of repetitive elements (masked coordinates) for the
reference dromedary genome (Drom64k; from NCBI).
FORMAT: BED

FILE:  XY.exclude.bed
MD5SUM: e2e45f398017243357f1bbe56cbc7671
DESCRIPTION:  This file contains the annotation
intervals of putative X and Y chromosomal
scaffolds for the reference genome (CB1; from NCBI).
FORMAT: BED

FILE:  All.SNPs.filtered.vcf.gz
MD5SUM: e6c459c8ec6c1e3831b7a55fdd738744
DESCRIPTION:  This file contains all the filtered SNPs
identified among all three camel species aligned to the
Camelus ferus (CB1) reference genome.
FORMAT: VCF v4.1 (compressed with gzip)

FILE:  All.SNPs.filtered.EFF.vcf.gz
MD5SUM: d4a725d29dca4047427b4ab20d7c8fbf
DESCRIPTION:  This file contains all the filtered SNPs
identified among all three camel species aligned to the
Camelus ferus (CB1) reference genome, and annotated
using SNPEFF (same SNPs as All.SNPs.filtered.vcf.gz).
FORMAT: VCF v4.1 (compressed with gzip)

FILE:  Drom.SNPs.filtered.vcf.gz
MD5SUM: ed9251a94e04424046c3015fd0f8e99c
DESCRIPTION:  This file contains all the filtered SNPs
identified among just dromedaries aligned to the
Camelus dromedarius (Cdrom64k) reference genome.
FORMAT: VCF v4.1 (compressed with gzip)

FILE:  *.windowed.pi
DESCRIPTION:  These files contains the nucleotide
diversity across non-overlapping 10-kb windows in
each camel species (see file name).
FORMAT: WINDOWED.PI

FILE:  *.Tajima.D
DESCRIPTION:  These files contains the Tajima's D
across non-overlapping 10-kb windows in each camel
species (see file name).
FORMAT: WINDOWED.PI

FILE:  *.snpden
DESCRIPTION:  These files contains the SNP density
across non-overlapping 10-kb windows in either
individual camels or across species (see file name).
FORMAT: SNPDEN

FILE:  PSMC.tar.gz
MD5SUM: 8471a445f6d1d2fb6c37012cf4ba3237
DESCRIPTION:  This file contains the output of the
PSMC program written by Heng Li
(https://github.com/lh3/psmc).
FORMAT: TAR.GZ (containing PSMC and PSMCFA files)

FILE:  Final.100kb.csv.gz
MD5SUM: 084936273909cf4e26255a1531ecb8a3
DESCRIPTION:  This file contains the full table of
output from the genomics_general python scripts.
Various measures of genetic variation (pi, theta,
heterozygosity, Fst, etc) are calculated in 100-kb
windows with a step size of 50 kb.
FORMAT: CSV (compressed with gzip)

FILE:  Drom.selected.genes.gff3
MD5SUM: 65c301f5ae958f650dbd7e3120b1297c
DESCRIPTION:  This file contains the annotation
information of the genes overlapping windows
putatively determined to be under selection in
Camelus dromedarius.
FORMAT: GFF3

FILE:  WC.selected.genes.gff3
MD5SUM: 9780ff52c3de8935ca4cc128ec0b1d0b
DESCRIPTION:  This file contains the annotation
information of the genes overlapping windows
putatively determined to be under selection in
Camelus ferus.
FORMAT: GFF

FILE:  ADMIXTURE.tar.gz
MD5SUM: 0b47fa35f63368a3477ae3bc81597647
DESCRIPTION:  This file contains all the
input and output files used to LD prune
the SNP dataset, generate the principal
components analysis, and determine ancestry
using the program ADMIXTURE. It is recommended
to see the code/methods at:
https://github.com/rfitak/Camel_Genomics/blob/master/admixture.md
FORMAT: TAR.GZ

####################################################
############ Appendix 1:  File Formats #############
####################################################

Format:  CSV
This format stores tabulated data in plain text format
Each line of this format is a record, with fields, or
columns separated by a comma (",").  Can be opened
in any text editor or spreadsheet software, e.g.
MS Excel, R.

Format:  TSV
This format stores tabulated data in plain text format
Each line of this format is a record, with fields, or
columns separated by a tab ("\t").  Can be opened
in any text editor or spreadsheet software, e.g.
MS Excel, R

Format:  VCF v4.1
This is the standard format for representing genomic
variation data.  It is a tab-delimited text file that
also contains a header.  All header lines begin with
a "#".  A complete description is available at:
https://samtools.github.io/hts-specs/VCFv4.1.pdf

Format:  WINDOWED.PI
This format stores tabulated data in plain text format.
The data are the nucleotide diversity (pi) outputs from
VCFtools v1.12b. Each line of this format is a record,
with fields, or columns separated by a tab ("\t").  Can
be opened in any text editor or spreadsheet software, e.g.
MS Excel, R.

Format:  SNPDEN
This format stores tabulated data in plain text format.
The data are the SNP density output from VCFtools v1.12b.
Each line of this format is a record, with fields, or
columns separated by a tab ("\t").  Can be opened
in any text editor or spreadsheet software, e.g.
MS Excel, R.

Format:  TAJIMA.D
This format stores tabulated data in plain text format.
The data are the Tajima's D outputs from
VCFtools v1.12b. Each line of this format is a record,
with fields, or columns separated by a tab ("\t").  Can
be opened in any text editor or spreadsheet software, e.g.
MS Excel, R.

Format:  PSMCFA
This is a text-based format specific to the PSMC
program by Heng Li.  See the program website at:
https://github.com/lh3/psmc
Specifically, "the program 'fq2psmcfa' in PSMC
transforms the DNA consensus sequence into a
fasta-like format where the i-th character in the
output sequence indicates whether there is at least
one heterozygote in the bin [100i, 100i+100)."

Format:  PSMC
This is a text -based format specific to the PSMC
program by Heng Li.  See the program website at:
https://github.com/lh3/psmc

Format:  FASTA
This format represents DNA sequences without quality
scores.  See https://en.wikipedia.org/wiki/FASTA_format
for more specific information.

Format:  GFF3
The General Feature Format (GFF) is one of the standard
formats for representing various elements in a genome.
The format is a tab-delimited text file with 9 columns.
See a complete description of the format and
the associated 9 columns here:
https://www.ensembl.org/info/website/upload/gff.html

Format:  BED
BED format is a tab-delimited text file that represents
regions in a genome.  Only three columns are required,
1) chromosome/scaffold, 2) start position, 3) stop position.
Positions are zero-based. See a complete description of
the format here:
https://useast.ensembl.org/info/website/upload/bed.html

Format:  TAR.GZ
The format represents a folder of files that have
been archived together (.tar) then compressed (.gz).
Most computers can unpack this file into a folder of
files by double-clicking automatically.  In most unix
machines, it can easily be unpacked from the command
line using:
tar -zxvf file.tar.gz

Funding

FWF Austrian Science Fund, Award: P29623-B25

FWF Austrian Science Fund, Award: P24706-B25