Skip to main content
Dryad

Four new genomes of the Pallas´s cat: an insight into the patterns of within-species variability

Cite this dataset

Plášil, Martin; Bubeníková, Jana; Burger, Pamela; Hořín, Petr (2024). Four new genomes of the Pallas´s cat: an insight into the patterns of within-species variability [Dataset]. Dryad. https://doi.org/10.5061/dryad.pzgmsbcvt

Abstract

Manul (Otocolobus manul) is the only representative of the genus Otocolobus belonging to the Leopard Cat lineage. Their habitat is characterized by harsh environmental conditions. Although their populations are probably more stable than previously thought, it is still the case that their population size is declining. The main cause of their decline is the destruction of their natural habitat, which together with their natural behavior results in geographically fragmented populations and a potential threat of loss of genetic variability. Conservation programs exist to protect manuls, but those based on captive breeding are often unsuccessful due to their increased susceptibility to diseases. The manul is therefore a suitable model species for evolutionary and diversity studies as well as for studying mechanisms of adaptation to harsh environment and mechanisms of susceptibility to diseases. Whole genome sequencing (WGS) is an important tool for such studies, providing base-by-base view of the genome. Recently, the genome of the Otocolobus manul based on nanopore long-range sequencing has been published. Using whole genome resequencing via the Illumina platform, we obtained information on the genomes of four other manuls aiming to better understand inter-and intraspecific variation of the species. The parameters characterizing the quality of sequencing were within the standard range, all four genomes analyzed were similar in most characteristics. On average, we detected a total of 3,668,327 polymorphic variants. Information on different types of structural varinats not available from the reference genome was retrieved. The average whole-genome heterozygosity detected was almost identical to that found in the Otocolobus manul reference genome. In this context, we performed a more detailed analysis of the candidate gene EPAS1 potentially related to adaptation to the hypoxic environment. This analysis revealed both inter-and intraspecific variation, confirmed the presence of a previously described non-synonymous substitution in exon 15 unique to manuls and identified three additional unique non-synonymous substitutions located in so far not analyzed EPAS1 exonic sequences.

README: Four new genomes of the Pallas´s cat: an insight into the patterns of within-species variability

https://doi.org/10.5061/dryad.pzgmsbcvt

This dataset contains results of genome analyses performed on 4 individual genomes of Otocolobus manul. The files contained in this dataset are as provided by the sequencing facility (Novogene). This dataset serves as a full primary data repository for a peer-reviewed manuscript with the same name.

Description of the data and file structure

Report

  • X204SC23111270-Z01-F002_Otocolobus_manul_Report: *contains graphical overview of all collected statistical data in html format.
  • src: structured elements used in the html report above (images, fonts, methods, etc.)

Results

contains subfolders of analysis results as presented in the graphical report above. Each subfolder contains readme file with further details.

  • 01.OriginalData: examples and explanation of input sequencing data

  • 02.QualityControl: quality control metrics for each individual using Fastp software

    • CleanData_QCsummary: tabular representation of basic sequencing data (Raw bases (bp), Clean bases (bp), Error rate (%), Q20 (%), Q30 (%), GC content (%)) for each sequenced animal
    • ErrorRate: graphical representation of sequencing error rate for every sequencing run for each sequenced animal both in pdf and png formats
    • Quality distribution: graphical representation of quality distribution as the sequencing runs progressed, pdf and png formats
    • ReadsClassification: graphical representation of different classes of reads (Clean reads, Adapter related, Containing N, Low quality) in each sequencing run, pdf and png formats
  • 03.Mapping: reference genome and mapping statistics, tools used - BWA, SAMtools, Picard

    • MapStat: tabular and graphical representation of various coverage statistics (average sequencing depth, 1X coverage (%), 4X coverage (%), coverage of individual reference chromosomes, xls, pdf and png formats
    • Reference
    • genome.fa: reference manul genome used for mapping
    • genes.gfff3: annotation record for reference genome
    • fa.stat: statistics of reference genome
    • genome.fa.fai: index file for reference genome
  • 04.SNP_VarDetect: SNP variation files (GATK) and SNP statistics and annotation, tools used - GATK and ANNOVAR

    • picture_in_reports: contains graphical representations of variation statistics used in the report, pdf and png formats
    • raw_variants.snp.avinput.exonic_variant_function: bgzip-compressed annotation results produced by ANNOVAR. The first, second and third column annotate variant line number in input file, the variant effects on coding sequences and the gene/transcript being affected, yet the other columns are reproduced from input file
    • raw_variants.snp.avinput.variant_function: bgzip-compressed annotation results produced by ANNOVAR. The second and third column annotate variant effects on gene structure and the genes that are affected, yet the other columns are reproduced from vcf file
    • raw_variants.snp.vcf: variants in bgzip-compressed VCF format
    • SNP.frequency.xls: frequency in which substitutions between different base pairs occurs
    • SNP_Annotation_Statistics.xls: statistics of SNP detection and annotation
  • 05.InDel_VarDetect: InDel variation files and InDel statistics and annotation, tools used - GATK and ANNOVAR

    • picture_in_reports: contains graphical representations of variation statistics used in the report, pdf and png formats
    • raw_variants.indel.avinput.exonic_variant_function: bgzip-compressed annotation results produced by ANNOVAR. The first, second and third column annotate variant line number in input file, the variant effects on coding sequences and the gene/transcript being affected, yet the other columns are reproduced from input file
    • raw_variants.indel.avinput.variant_function: bgzip-compressed annotation results produced by ANNOVAR. The second and third column annotate variant effects on gene structure and the genes that are affected, yet the other columns are reproduced from vcf file
    • raw_variants.indel.vcf: indel variants in bgzip-compressed VCF format
    • InDel.CDSpercentage.xls: Length Distribution of CDS-located InDels
    • InDel.GENOMEpercentage.xls: Length Distribution of GENOME-located InDels
    • InDel_Annotation_Statistics.xls: Statistics of indel distribution based on the indel classification
  • 06.SV_VarDetect: Structural variants variation files and SV statistics and annotation, tools used - Breakdancer and ANNOVAR

    • Manul_x: each individual has a folder containing -
    • Manul_x.filted.ctx.gz: bgzip-compressed results file of CTX variation
    • Manul_x.sv.avinput.variant_function.gz: bgzip-compressed result of SV annotations
    • picture_in_reports: contains graphical representations of SV length distribution, pdf and png formats
    • SV.len.rate.xls: Length Distribution of SVs
    • SV_Annotation_Statistics.xls: contains Statistics of SV detection and annotation
  • 07.CNV_VarDetect: Copy number variants variation files and CNV statistics and annotation, tools used - CNVnator and ANNOVAR

    • Manul_x: each individual has a folder containing -
    • Manul_x.cnv.vcf.gz: The CNV variations detection result in bgzip-compressed format
    • Manul_x.cnv.avinput.variant_function.gz: CNV annotation results in bgzip-compressed format
    • picture_in_reports: contains graphical representation of CNV Annotation Statistics, pdf and png formats
    • CNV_Annotation_Statistics.xls: Statistics of CNV detection and annotation
  • 08.VarDetect_Visualisation: Visualisation of different detected variant types across studied genomes, each individual has separate files indicated by the ID, tools used - Circos

    • Manul_x.Circos: graphical representation of whole genome variations distribution, pdf and png formats
    • Manul_x.snpDensity: graphical representation of SNP density of whole genome, pdf and png formats
    • Manul_x.indeDensity: graphical representation of indel density of whole genome, pdf and png formats

Sharing/Access information

Raw genomic data were uploaded to the SRA and are available under BioProject number PRJNA1098449.

Methods

Tissues of four different manul individuals (one female and three males) obtained from Czech ZOOs (ZOO Jihlava, ZOO Brno, ZOO Prague) were selected based on available information about their origin and relatedness. The samples were obtained either post-mortem or as part of veterinary procedures performed for other reasons.

DNA was extracted from available tissue samples (either blood, spleen or colon) using Qiagen (Germany) MagAttract HMW DNA isolation kit. The kit was used according to the manufacturer’s recommendations. Two isolations were made for each individual. DNA samples were evaluated in terms of purity (absorbance) and concentration using Tecan (Switzerland) Infinite 200 Pro plate reader. DNA samples were stored at 4°C for 5 days and then transported on dry ice to the Novogene sequencing facility in Germany. Samples were checked prior to library construction using Agilent 5400 fragment analyzer. All samples passed QC (quantity ≥ 200 ng; OD260/280=1.8-2.0, no degradation).

Sequencing was performed using the Illumina NovaSeq X Plus platform as a service provided by Novogene (China).

 A total amount of 0.2 μg DNA per sample was used as input material for the DNA library preparations. The genomic DNA sample was randomly fragmented by sonication to a size of 350 bp. Then DNA fragments were endpolished, A-tailed, and ligated with the full-length adapter for Illumina sequencing. The fragments with adapters were size selected, PCR amplified, and purified by AMPure XP system (Beverly, USA). Subsequently, library quality was assessed on the Agilent 5400 system (Agilent, USA) and quantified by QPCR (1.5 nM). The qualified libraries were pooled and sequenced on Illumina NovaSeq X Plus platform with PE150 chemistry.

Distribution of sequencing quality along reads and sequencing error rate were evaluated, low-quality reads and adaptors were filtered using Fastp (v.0.20.0) with parameters -g -q 5 -u 50 -n 15 -l 150.

The OtoMan_p1.0 genome (GCA_028564725.2) was used as reference. The filtered sequencing data were mapped to the reference sequence through BWA (Li et al., 2009a) software (parameters: mem -t 4 -k 32 -M). Resulting alignments were sorted using SAMtools (v1.13) with parameters sort -@ 6 -m 2G and merged for each sample using Picard (v1.111).

Single nucleotide polymorphisms (SNP) and indels were called for entire cohort (joint calling) using Haplotypecaller from GATK (v4.0.5.1) (DePristo et al., 2011) with the following parameters --pair-hmm-gap-continuation-penalty 10 -ERC GVCF --genotyping-mode DISCOVERY -stand-call-conf 30. Polymorphisms detected were annotated using ANNOVAR (v2015Dec14) (Wang et al., 2010) and their characteristics (e.g. quality, total numbers and distribution in different genomic regions) were evaluated. The original annotation record for reference genome was kindly provided by the authors (Flack et al. 2023).

As for structural variants (SVs), BreakDancer (v1.4.4) (Chen et al., 2009) software was used with parameter -q 20 to detect indels, inversions, intra-chromosomal translocations and inter-chromosomal translocations. The SVs detected were filtered by removing those with less than 2 supporting reads; indels and inversions were further annotated by ANNOVAR. Characteristics of SVs such as their total numbers, distribution across genome and length were assessed.

Based on the genome reads depth, CNVnator (v0.3) (Abyzov et al., 2011) was used to detect Copy Number Variants (CNVs) of potential deletions and duplications with the following parameter -call 100. The CNVs detected were further annotated by ANNOVAR and their characteristics determined.

The distribution of all types of variants across the whole genomes were visualized by Circos (Krzywinski et al., 2009).

The genomic data were submitted to the NCBI under the BioProject ID PRJNA1098449. The individual BioSample IDs for each genome are SAMN40907891, SAMN40907892, SAMN40907893 and SAMN40907894.

Funding

Czech Science Foundation, Award: 21-28637L

FWF Austrian Science Fund, Award: I5081-B