Skip to main content

A chromosome-scale reference genome and genome-wide genetic variations elucidate adaptation in yak

Cite this dataset

Zhong, Jin-cheng et al. (2020). A chromosome-scale reference genome and genome-wide genetic variations elucidate adaptation in yak [Dataset]. Dryad.


Yak is an important livestock for the people who lived in harsh and oxygen-deprived Qinghai-Tibetan Plateau and Hindu-Kush Himalayan Mountains. Although there is a yak genome be sequenced in 2012, the assembly is quite fragmented due to the limitation of Illumina sequencing technology. An accurate and complete reference genome is critical for studying genetic variation of a specie. Long-read sequences are more complete than short-read ones, and they have been successfully used for high-quality genome assembly in several species. Here, we present a high-quality assembly of the yak genome (PB_v1.0) at chromosome scale, which was constructed using long-read sequencing technology assisted by chromatin interaction technology. Compared to the previous yak genome assembly (BosGru_v2.0), the PB_v1.0 assembly has substantially improved chromosome sequence continuity, minimized repetitive structure ambiguity, and achieved gene model completeness. To intensively characterize genetic variation of yak, we generated de novo genome assemblies based on Illumina short reads of seven recognized domestic yak breeds from Tibet and Sichuan as well as one wild yak from Hoh Xil. By comparing these eight assemblies to the PB_v1.0 genome, we obtained a comprehensive map of yak genetic diversity at whole genome level and identified a few protein-coding genes that were absent from the PB_v1.0 assembly. Although wild yak suffered bottleneck effect, the genetic diversity of wild yak is still higher than that of domestic yak. By whole genome alignment, we identified breed-specific sequences and genes, this will help the breeds identification of yak.


High-quality DNA was extracted from the peripheral blood of a female yak in Riwoqe County, Tibet. SMRT sequencing libraries were constructed with a Blood&Cell Culture DNA Mini Kit (Qiagen, Hilden, Germany). A total of 142 SMRT cells generated 184.6 Gbp of subread bases with a mean read length of 9.5 kbp on a PacBio RS II instrument (Pacific Biosciences, Menlo Park, CA, USA). The Falcon (v. 0.5.0) pipeline was used for the initial assembly. The first step was to identify all overlaps in the raw reads. Then, the read error was corrected by leveraging the overlap information. The second step was to detect overlaps in the corrected reads. This step required no consensus calling. The final step was to generate the string graph assembly and the contig sequence output in FASTA format. To improve the quality of the initial assembly, 113.34 Gbp of Illumina short reads were generated from the same individual. Using Pilon(v1.23)8, 845,002 homozygous insertions, 166,908 deletions, and 2,355,196 substitutions were identified and corrected. DNA from the same individual used in the PacBio sequencing was extracted and processed according to BioNano Genomics guidelines. The raw data were assembled with the BioNano Solve (v. 3.1.00) assembly pipeline (BioNano Genomics, San Diego, CA, USA). The combination of this assembly with the initial one yielded a superior assembly with a scaffold N50 of 65.67 Mbp and a maximum scaffold length of 128.62 Mbp. Hi-C libraries were created from yak whole-blood cells, 2–5 million cells were cross-linked and digested with the restriction enzyme HindIII. The sticky ends of all fragments were biotinylated, ligated to each other to form chimeric circles, enriched, sheared, and processed into sequencing libraries wherein the individual templates were chimeras of the physically associated DNA molecules from the original cross-linking. Hi-C reads was generated by Illumina Sequencing platform. The paired-end reads were uniquely mapped onto the Bionano assembly, classified into 30 groups using 3d-DNA(20180922) as the final assembly, and referred to as PB_v1.0. The exact locations of each scaffold in the 30 groups were based on the collinearity between yak and cattle (UMD3.1.1). 

Seven domestic yak breeds and one wild yak were selected for whole-genome sequencing and assembly. DNA was extracted from the ears of the Tibetan breeds, the blood of the Sichuan breeds, and the skin of the wild yak from Kunlun Spring, Hoh Xil. A whole-genome shotgun strategy and next-generation sequencing (NGS) technologies were run on the Illumina HiSeq 2500 platform (Illumina, San Diego, CA, USA). Each genome was sequenced with a combination of short-insert (180 bp and 500 bp) and long-insert (2 kbp and 5 kbp) DNA libraries. SOAPdenovo (v2.04) was used to assemble each genome.


Tibetan Autonomous Region special grants, Award: CARS-37

Key Research and Development Projects in Tibet: Preservation of Characteristic Biological Germplasm Resources and Utilization of Gene Technology in Tibet, Award: ZH20200002

The Second Tibetan Plateau Scientific Expedition and Research Program (STEP), Award: 2019QZKK0501

Tibetan Autonomous Region special grants, Award: CARS-37

The Second Tibetan Plateau Scientific Expedition and Research Program (STEP), Award: 2019QZKK0501