Assemblies, associated annotation files, and analysis source data of Platanus x acerifloia genome
Data files
May 02, 2024 version files 7.57 GB
-
Platanus_acerifolia_Group1_CDS.fasta
-
Platanus_acerifolia_Group1_genome.fasta
-
Platanus_acerifolia_Group1_prot.fasta
-
Platanus_acerifolia_Group1.gff
-
Platanus_Group1_function.xls
-
Platanus_hexaploid_CDS.fasta
-
Platanus_hexaploid_function.xls
-
Platanus_hexaploid_gene.gff3
-
Platanus_hexaploid_genome.fasta
-
Platanus_hexaploid_prot.fasta
-
Platanus_source_data.zip
-
README.md
Abstract
Platanus × acerifolia (London plane; Platanaceae) is a major ornamental tree used worldwide. Platanaceae is one of the last early-diverging eudicot families without a complete nuclear genome assembly. Here, we assembled a high-quality, chromosome-level reference genome for P. × acerifolia.
README: Assemblies, associated annotation files, and analysis source data of Platanus x acerifloia genome
https://doi.org/10.5061/dryad.j6q573nm9
Description of the data and file structure
This is the genome assembly, annotation, and source file of Platanus x acerifloia
File Name | Description |
---|---|
Platanus_acerifolia_Group1_genome.fasta | Pseudo-haplotype genome sequence file |
Platanus_acerifolia_Group1.gff | Pseudo-haplotype genome annotation file |
Platanus_acerifolia_Group1_prot.fasta | Pseudo-haplotype AA sequence file |
Platanus_acerifolia_Group1_CDS.fasta | Pseudo-haplotype CDS sequence file |
Platanus_acerifolia_Group1_function.xls | Pseudo-haplotype Function annotation |
Platanus_hexaploid_genome.fasta | Hexaploid genome sequence file |
Platanus_hexaploid_gene.gff3 | Hexaploid genome annotation file |
Platanus_hexaploid_prot.fasta | Hexaploid AA sequence file |
Platanus_hexaploid_CDS.fasta | Hexaploid CDS sequence file |
Platanus_hexaploid_function.xls | Hexaploid Function annotation |
Platanus_source_data.zip | Source analysis data of Pseudo-haplotype Platanus x acerifolia |
The * Group1 * files contain the datasets from the Pseudo-haplotype genome of Platanus x acerifolia;
The hexaploid * files contain the datasets from the hexaploid genome of *Platanus x acerifolia;
The *.fasta files contain the nucleic acid sequences or peptide sequences represented by single letters with sequence names before the sequence in fasta format.
The *.gff3 files contain the structural annotation of genes in gff3 format.
The *function.xls contain the functional annotation of genes.
The Platanus_source_data.zip file contains source analysis data for Platanus x acerifolia. The README files within this zip file provide descriptions of the data.
Methods
Genome and transcriptome sequencing
The extracted genomic DNA met the requirement for both Illumina (San Diego, CA, USA) sequencing and PacBio High Fidelity (HiFi; Covaris, Massachusetts, USA) sequencing (Supplementary Note 2.2). For Illumina sequencing of the P. × acerifolia genome, a 350-bp library was constructed and sequenced on the Illumina NovaSeq platform. The approximately 10-kb SMRTbellTM libraries were constructed for single-molecule sequencing with PacBio; the data generated were used to assemble the highly heterogeneous P. × acerifolia polyploid genome. The HiFi reads were sequenced using the PacBio Sequel II system following PacBioTM standard protocol. To anchor genome scaffolds onto chromosomes, we also constructed a Hi-C library and obtained sequence data via the MGI-SEQ 2000 platform (Wuhan, Hubei, China). In brief, fresh leaves were fixed with a nuclei isolation buffer containing 2% formaldehyde following previously reported methods. The DNA was digested with DpnII. Biotin-14-dCTP was removed from unligated DNA ends due to the exonuclease activity of T4 DNA polymerase, and then the ligated DNA was sheared into 300−600-bp fragments. Finally, the raw cross-linked fragments were processed into paired-end sequencing libraries, and the obtained libraries were sequenced on the MGI-SEQ 2000 platform. All the transcriptome libraries were also sequenced on the MGI-SEQ 2000 platform to obtain high-quality paired-end reads for subsequent analyses.
De novo assembly and chromosome anchoring
Before the de novo assembly of the P. × acerifolia genome, the genome size was estimated. After data trimming by fastp, 98 Gb of high-quality paired-end reads were obtained from NGS genome datasets. Then high-quality reads were split into k-mers from 17 bp to 31 bp by Jellyfish, and genome size was estimated using GenomeScope. The HiFi Circular Consensus Sequence (CCS) reads were generated by CCS software (github.com/PacificBiosciences/ccs) with parameters ‘--min-passes 1 --min-rq 0.99 --min-length 100’. Hifiasm was employed to generate contigs and resolve segmental duplications. Afterward, contigs annotated to mitochondrial DNA, plasmids, and metazoa against the NT database by BLASTN were removed.
Benchmarking Universal Single-Copy Ortholog (BUSCO) scores were used to assess the completeness of the assembled genome by comparison against embryophyta_odb10. This process can be done without genome annotation via "-m genome". We examined the sequence identity by aligning the Illumina NovaSeq-generated paired-end reads to the assembled genome using BWA.
For chromosome anchoring, we first filtered low-quality sequences (quality scores <20), adaptor sequences, and short sequences (<30 bp) from Hi-C raw data by fastp software. The clean paired-end reads were mapped to the assembled contigs by Bowtie2 with the parameters ‘-end-to-end --very-sensitive -L 30’. Invalid read pairs including dangling-end, self-cycle, re-ligation, and dumped products were filtered by HiC-Pro. After the valid interaction pairs were generated, the contigs were further clustered, ordered, and oriented onto chromosomes through valid interaction pairs by LACHESIS with the parameters ‘CLUSTER MIN RE SITES=100; CLUSTER MAX LINK DENSITY=2.5; CLUSTER NONINFORMATIVE RATIO =1.4; ORDER MIN N RES IN TRUNK=60; ORDER MIN N RES IN SHREDS=60’. Finally, we manually corrected the errors in the placement and orientation of the contigs exhibiting obvious discrete chromatin interaction patterns.
Genome annotation
We used de novo prediction and similarity mapping prediction strategies to identify repeat elements and gene boundaries. De novo predictions of repeat elements and family clustering were performed using RECON, RepeatScout, and RepeatModeler. For similarity-based repeat element prediction, RepeatMasker was used to identify the repeat elements from libraries of Repbase and Dfam. Long terminal repeats (LTRs) were detected by LTR_retriever and LTR_finder. INFERNAL was used to predict non-coding RNA based on the Rfam database, and tRNAscan-SE was employed to identify tRNAs.
The gene structure of protein-coding genes was annotated by RNA-seq-based prediction, ab initio prediction, and similarity-based prediction. We used Trinity to generate de novo transcript assemblies. In addition, transcriptome reads also were aligned to the P. × acerifolia genome by TopHat2 and assembled to transcripts via Cufflinks. CD-HIT was employed to filter the combined transcripts with the parameters ‘-aL -c 0.8’. Afterward, PASA (Program to Assemble Spliced Alignments) was used to align transcripts to the P. × acerifolia genome and predict complete structure with high confidence. Based on the resultant high-confidence complete structure, high-quality genic models were predicted using PASA, SNAP, GeneMark, and AUGUSTUS for ab initio annotation. Protein profiles from four species with the closest phylogenetic relationship to P. × acerifolia according to APG IV were used for similarity-based prediction of gene structures (viz., Nelumbo nucifera, Telopea speciosissima, Macadamia integrifolia, and Platanus occidentalis (PRJNA79937)). Finally, gene structures were predicted using MAKER by integrating the above-mentioned prediction methods. The functions of protein-coding genes were identified by aligning protein sequences against the NR database, Gene Ontology, eggNOG, InterPro, Uniprot, TrEMBL, and KEGG databases. After that, the completeness of the annotation was assessed against the embryophyta_odb10 using BUSCO.