Data from: Pan-genome analysis highlights the role of structural variation in the evolution and environmental adaptation of Asian honeybees
Data files
Oct 26, 2023 version files 2.92 GB
-
data_release.zip
2.92 GB
-
README.md
8.77 KB
Abstract
The Asian honeybee, Apis cerana, is an ecologically and economically important pollinator. Mapping its genetic variation is key to understanding population-level health, histories, and potential capacities to respond to environmental changes. However, most efforts to date were focused on single nucleotide polymorphisms (SNPs) based on a single reference genome, thereby ignoring larger-scale genomic variation. We employed long-read sequencing technologies to generate a chromosome-scale reference genome for the ancestral group of A. cerana. Integrating this with 525 resequencing datasets, we constructed the first pan-genome of A. cerana, encompassing almost the entire gene content. We found that 31.32% of genes in the pan-genome were variably present across populations, providing a broad gene pool for environmental adaptation. We identified and characterized structural variations (SVs) and found that they were not closely linked with SNP distributions, however, the formation of SVs was closely associated with transposable elements. Furthermore, phylogenetic analysis using SVs revealed a novel A. cerana ecological group not recoverable from the SNP data. Performing environmental association analysis identified a total of 44 SVs likely to be associated with environmental adaptation. Verification and analysis of one of these, a 330 bp deletion in the Atpalpha gene, indicated that this SV may promote the cold adaptation of A. cerana by altering gene expression. Taken together, our study demonstrates the feasibility and utility of applying pan-genome approaches to map and explore genetic feature variations of honeybee populations, and in particular to examine the role of SVs in the evolution and environmental adaptation of A. cerana.
README: Materials for the pan-genome study of Apis cerana
In this study, we generated three chromosome-scale reference genome sequence by applying PacBio sequencing and high-throughput chromosome conformation capture (Hi-C) techniques, including a ancestral group of A. cerana (named as HB to indicate samples were collected from the Hubei Province in China). In addition, using reference guidance assembly methods, we collected and used 525 high-quality resequencing datasets to construct the A. cerana pangenome.
Through the integrative analysis of multiple genomic datasets including reference genomes and the first A. cerana pan-genome, our study provides a comprehensive catalogue of the repertoire of genetic variants of this widespread Asian honeybee. The results highlight the importance of using more reference genomes and genotyping larger-sized genetic variations to extend investigations beyond SNPs and small Indels and reveal additional contributors to the molecular basis of adaptation. These genomic resources and catalogued variants will be valuable for researchers interested in further developing our understanding of the molecular basis of adaptation in A. cerana and will lay the foundation for future conservation and management of this important pollinator.
Description of the data and file structure
Here is the data employed in the Apis cerana pan-genome analysis, which includes three chromosome-level genomes with genome annotation, non-reference sequences with genome annotation, two scaffold genomes, transposable element sequences detected in each genome, a binary matrix file for gene presence-absence variation (PAV), and genotypic variation data from 525 bee samples, encompassing both single nucleotide polymorphisms (SNPs) and structural variations (SVs).
The dataset contains the following content:
data_release/Sequence-Pan-genome
- data_release/Sequence-Pan-genome/AB
- The assembled AB population of Apis cerana genome was annotated using MAKER2 pipline.
- data_release/Sequence-Pan-genome/AB/AB_final_assembly.FINAL.fasta
- Genome assembly .fasta file for the female Apis cerana genome assembly. The assembled genome containing 23 chromosomes and X unassembled scaffolds genome.
- Sequence-Pan-genome/AB/AB.mod.EDTA.TElib.fa
- The transposon sequence was annotated to by EDTA
- data_release/Sequence-Pan-genome/AB/AB-all.gff The accompanying gene annotation file for the AB_final_assembly.FINAL.fasta in .gff format. Annotations were generated by MAKER2 (v2.31.9) pipeline.
- data_release/Sequence-Pan-genome/AB/AB.all.maker.proteins.fasta Gene peptide sequences in .fasta format.
- data_release/Sequence-Pan-genome/AB/AB.all.maker.transcripts.fasta
- Gene CDS sequences in .fasta format.
- data_release/Sequence-Pan-genome/HB-REF
- In this study, we generated a chromosome-scale reference genome sequence for the ancestral group of Apis cerana (named as HB to indicate samples were collected from the Hubei Province in China). The assembled HB genome was annotated using MAKER2 (v2.31.9) pipline.
- data_release/Sequence-Pan-genome/HB-REF/HB-ref.fasta
- Genome assembly .fasta file for the female Apis cerana genome assembly. The assembled genome containing 23 chromosomes and X unassembled scaffolds genome.
- Sequence-Pan-genome/HB-REF/HB.mod.EDTA.TElib.fa
- The transposon sequence was annotated to by EDTA
- data_release/Sequence-Pan-genome/HB-REF/HB.FINAL.all.gff
- The accompanying gene annotation file for the HB-ref.fasta in .gff format. Annotations were generated by MAKER2 (v2.31.9) pipeline.
- data_release/Sequence-Pan-genome/HB-REF/HB.all.maker.proteins.fasta
- Gene peptide sequences in .fasta format.
- data_release/Sequence-Pan-genome/HB-REF/HB.all.maker.transcripts.fasta
- Gene CDS sequences in .fasta format.
- data_release/Sequence-Pan-genome/HN
- The assembled HN population of Apis cerana genome was annotated using MAKER2 (v2.31.9) pipline.
- data_release/Sequence-Pan-genome/HN/HN_final_assembly.FINAL.fasta
- Genome assembly .fasta file for the female Apis cerana genome assembly. The assembled genome containing 23 chromosomes and X unassembled scaffolds genome.
- Sequence-Pan-genome/HN/HN.mod.EDTA.TElib.fa
- The transposon sequence was annotated to by EDTA
- data_release/Sequence-Pan-genome/HN/HN-all.gff
- The accompanying gene annotation file for the HN_final_assembly.FINAL.fasta in .gff format. Annotations were generated by MAKER2 (v2.31.9) pipeline.
- data_release/Sequence-Pan-genome/HN/HN.all.maker.proteins.fasta
- Gene peptide sequences in .fasta format.
- data_release/Sequence-Pan-genome/HN/HN.all.maker.transcripts.fasta
- Gene CDS sequences in .fasta format.
- data_release/Sequence-Pan-genome/JL
- The assembled Apis cerana genome in JL region (Jilin Province, China).
- data_release/Sequence-Pan-genome/JL/jl_clean_assembly.fasta
- Genome assembly .fasta file for the female Apis cerana genome assembly. The assembled genome containing 23 chromosomes and X unassembled scaffolds genome.
- Sequence-Pan-genome/JL/JL.mod.EDTA.TElib.fa
- The transposon sequence was annotated to by EDTA
- data_release/Sequence-Pan-genome/JX
- The assembled Apis cerana genome in JX region (Jiangxi Province, China).
- data_release/Sequence-Pan-genome/JX/jx_clean_assembly.fasta
- Genome assembly .fasta file for the female Apis cerana genome assembly. The assembled genome containing 23 chromosomes and X unassembled scaffolds genome.
- Sequence-Pan-genome/JX/JX.mod.EDTA.TElib.fa
- The transposon sequence was annotated to by EDTA
- data_release/Sequence-Pan-genome/None-ref
- The WGS reads of each A. cerana sample were trimmed, de-duplicated and then aligned to the reference genome (HB) to obtain the non-reference reads (NRRs). The NRRs were assembled and then mapped again to the reference genome to obtain the non-reference sequence. The non-reference sequence was then filtered out of contaminants and redundant sequences. Clean non-redundant sequences from each sample were iteratively added into the final pan-genome sequence.
- data_release/Sequence-Pan-genome/None-ref/Non-ref.fasta
- Genome assembly .fasta file for Apis cerana None-ref sequences.
- data_release/Sequence-Pan-genome/None-ref/Non-ref.all.gff
- The accompanying gene annotation file for the Non-ref.fasta in .gff format. Annotations were generated by MAKER pipeline.
- data_release/Sequence-Pan-genome/None-ref/Non-ref.all.maker.proteins.fasta
- Gene peptide sequences in .fasta format.
- data_release/Sequence-Pan-genome/HN/HN.all.maker.transcripts.fasta
- Gene CDS sequences in .fasta format.
- Sequence-Pan-genome/Apis_cerana_Pan_gene.xlsx
- Apis cerana pan-genome genes annoations summary.
- Column headers:
- Chromosome
- The number of the chromosome on which the gene is located
- Gene-ID
- The name of the gene we used in our pan-genome study on Apis cerana
- GFF/Seq-ID
- The original names of reference genes (HB genome) and non-reference genes annotated in the Maker2 pipline.
- Start
- The starting position of the gene
- End
- The ending position of the gene
data_release/Genotyped-file
- Genotyped-file/all.clean.snp.vcf.gz
- The genotypes of SNPs located on the reference genome (HB) were identified by Genome Analysis Toolkit (GATK, version 4.2.1.0) pipeline.
- Genotyped-file/Base_SV.vcf.gz
- Structural variants (SVs) were called using three classic WGS SV callers. A total of 19,955 non-redundant SVs were identified, with each A. cerana sample yielding an average of 1,550 SVs.
- Genotyped-file/Pan-Gene-PAVmatrix.txt A binary matrix that integrates the PAV information of all sample genes. To call the gene presence and absence (PAV) matrix, we mapped the raw reads from each A. cerana sample to the pan-genome using BWA-MEM v0.7.12 with default parameters. The gene PAV information was detected from the mapped bam file using SGSGeneLoss v0.1 software with the option "minCov=2 lostcut=0.2". If more than 80% of exon regions were covered by at least 2 reads, this gene was called as present with the "1/1" genotype. As low sequencing depth regions may produce potential bias in the acquisition of read evidence, we only retained A. cerana samples with average sequencing coverage over 0.8 and average sequencing depth over 5. Following the removal of low-quality data, we generated a PAV binary matrix representing the presence or absence of genes in each sample.