Data from: Chromosome-scale genome assembly of bread wheat’s wild relative Triticum timopheevii
Data files
Jan 30, 2024 version files 19.45 GB
-
README.md
9.01 KB
-
Tim.S95.v2.hifi.5mC.3col.bed
9.60 GB
-
Timopheevii.final.pm.oriented.fasta.gz
2.78 GB
-
Timopheevii.final.pm.oriented.pretext
63.97 MB
-
Timopheevii.final.pm.oriented.with_org.fasta.gz
2.78 GB
-
Tritim_EIv0.3.annotation_repeatsRM.gff3
3.32 GB
-
Tritim_EIv0.3.annotation.cds.fasta.gz
65.84 MB
-
Tritim_EIv0.3.annotation.final_table.tsv
31.52 MB
-
Tritim_EIv0.3.annotation.gff3
352.48 MB
-
Tritim_EIv0.3.annotation.pep.fasta.gz
36.77 MB
-
Tritim_EIv0.3.release.functional_annotation.gff3
362.74 MB
-
Tritim_EIv0.3.release.gff3.pep.fasta.functional_annotation.tsv
49.33 MB
Abstract
Wheat (Triticum aestivum) is one of the most important food crops with an urgent need for increase in its production to feed the growing world. Wheat’s wild relative species provide a hugely untapped reservoir of genetic diversity for wheat improvement. Triticum timopheevii (2n = 4x = 28) is a tetraploid wheat wild relative species containing the At and G genomes that has been exploited in many wheat pre-breeding programmes over the last few decades. In this study, we report the generation of a chromosome-scale reference genome assembly of T. timopheevii accession PI 94760 based on PacBio HiFi reads and chromosome conformation capture (Hi-C). The total assembly size was 9.35 Gb with a contig N50 of 42.4 Mb. In total, 166,325 gene models were predicted. Comparative genome analysis confirmed previously known chromosomal translocations and indicated new chromosome rearrangements. Analysis of the genomic distribution of DNA methylation showed that the G genome had on average more methylated bases than the At genome. The G genome was also more closely related to the S genome of Aegilops speltoides than to the B genome of hexaploid or tetraploid wheat. In summary, the T. timopheevii genome assembly provides a valuable resource for genome-informed discovery and cloning of agronomically important genes for future food security.
README: Chromosome-scale genome assembly of bread wheat’s wild relative *Triticum timopheevii*
https://doi.org/10.5061/dryad.mpg4f4r6p
Assembly\
Pseudomolecules assembled with Hifiasm and Salsa2 and organelle genome scaffolds assembled with Oatk (https://github.com/c-zhou/oatk)
- Timopheevii.final.pm.oriented.with_org.fasta.gz (assembly with organellar genomes)
- Timopheevii.final.pm.oriented.fasta.gz (assembly of only nuclear chromosomes)
Hi-C Contact map generated by mapping short-reads using the Arima pipeline (https://github.com/ArimaGenomics/mapping_pipeline) and scaffolds manually curated with Rapid Curation Pipeline (https://gitlab.com/wtsi-grit/rapid-curation)
- Timopheevii.final.pm.oriented.pretext
- This is the final HiC contact map generated by mapping the HiC short reads onto the curated assembly. This file can be opened using PretextView v.0.2.5 (available at https://github.com/wtsi-hpag/PretextView/releases)
Annotation\
Gene models were generated from the Timopheevii.final.pm.oriented.fasta assembly using REAT - Robust and Extendable eukaryotic Annotation Toolkit https://github.com/EI-CoreBioinformatics/reat and Minos https://github.com/EI-CoreBioinformatics/minos which make use of Mikado https://github.com/EI-CoreBioinformatics/mikado, Portcullis https://github.com/EI-CoreBioinformatics/portcullis and many third party tools (listed in the above repositories).
1) Repeat annotation was performed using EI-Repeat (https://github.com/EI-CoreBioinformatics/eirepeat) which uses third party tools (listed in the repository). Repeat Annotation file is: Tritim_EIv0.3.annotation_repeatsRM.gff3
2) A reference set of hexaploid wheat gene models was derived from public gene sets (IWGSC and 10+wheat) projected onto the IWGSC 161010_Chinese_Spring_v1.0_pseudomolecules.fasta assembly; a filtered and consolidated set of models was derived with Minos, with a primary model defined for each gene. Models were scored on a combination of intrinsic gene structure characteristics, evidence support (protein and transcriptome data) and consistency in gene structure across the input gene models. The Minos primary models were classified as full-length or partial based on alignment to a filtered magnoliopsida Swiss-Prot TrEMBL database. This assignment, together with criteria for gene structure characteristics and the original confidence classification, was used to classify models into 6 categories (Platinum, Gold, Silver, Bronze, Stone and Paper), with Platinum being the highest confidence category for models assessed as full-length, with an original confidence classification of "high", meeting structural checks for number of UTR and CDS/cDNA ratio and which were assessed as consistently annotated across the input gene sets. Reclassification resulted in 55319 Platinum, 24789 Gold, 11968 Silver, 61845 Bronze, 110518 Stone and 115336 Paper genes. The four highest confidence categories Platinum, Gold, Silver and Bronze were projected onto the Timopheevii.final.pm.oriented.fasta assembly with Liftoff, only those models transferred fully with no loss of bases and identical exon/intron structure were retained (ei-liftover pipeline, https://github.com/lucventurini/ei-liftover).
3) High confidence genes annotated in the hexaploid wheat cv. Chinese Spring iwgsc_refseqv2.1_assembly.fa assembly were projected onto the Timopheevii.final.pm.oriented.fasta assembly with Liftoff, and only those models transferred fully with no loss of bases and identical exon/intron structure were retained (ei-liftover pipeline, https://github.com/lucventurini/ei-liftover). Among these, gene models with the attribute "manually_curated" in the original iwgsc_refseqv2.1_assembly.fa assembly were extracted as a set.
4) Gene models were created via the REAT Transcriptome workflow using Illumina RNA-Seq, and PacBio IsoSeq and FLNC reads provided by Surbhi Grewal, Nottingham University (IsoSeq + FLNC only models and Illumina, IsoSeq and FLNC combined models).
5) Proteins from gene models of related species (GCF_000003195.3,GCF_000005505.3,GCF_000263155.2,GCF_001433935.1,GCF_002162155.2,GCF_002211085.1,GCF_002575655.2,GCF_016808335.1,GCF_902167145.1,GCF_904849725.1) were aligned and a set of protein based gene models derived using the REAT Homology workflow.
6) Evidence guided gene models based on transcriptome and proteins alignments were generated via augustus (3 alternative configurations and weightings of evidence) and EVidenceModeler via the REAT Prediction workflow.
7) Gene models from (2),(3),(4), (5) and (6) were used in Minos and a final set of models selected based on evidence support, intrinsic features of the models and a base score relating to the source of the gene models (see 'config' directory).
For all gene models a confidence and biotype classification was determined based on available evidence support. See config/minos_run.run_config.yaml for details. Predicted genes have low homology support and coding potential and may potentially include pseudogenes, fragments, lncRNAs with small coding regions and miss-annotated CDS features. Transposable element gene classification is based simply on overlap with the identified repeats (>40% repeat overlap). All ncRNAs were classed as low confidence.
Structural Annotation Files
- Tritim_EIv0.3.annotation.gff3
- Tritim_EIv0.3.annotation.cds.fasta.gz
- Tritim_EIv0.3.annotation.pep.fasta.gz
- Tritim_EIv0.3.annotation.final_table.tsv
Repeat Annotation Files
- Tritim_EIv0.3.annotation_repeatsRM.gff3
# Functional annotation:
All the proteins were annotated using AHRD v.3.3.3 (Hallab et al., 2014; https://github.com/groupschoof/AHRD/blob/master/README.textile). Sequences were blasted against the reference proteins (Arabidopsis thaliana TAIR10, TAIR10_pep_20101214_updated.fasta.gz - https://www.araport.org) and the UniProt viridiplantae sequences (data download date 06 May 2023), both Swiss-Prot and TrEMBL datasets (The UniProt Consortium, 2014). Proteins were BLASTed (v2.6.0; blastp) with an e-value of 1e-5. We have also provided interproscan (v5.22.61; Jones et al., 2014) results to AHRD. We adapted the standard AHRD example configuration file path test/resources/ahrd_example_input_go_prediction.yml, distributed with the AHRD tool, changing the following apart from the location of input and output files:
1. we included the GOA mapping from uniprot (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz) as parameter 'gene_ontology_result',
2. we also included the interpro database (ftp://ftp.ebi.ac.uk/pub/databases/interpro/61.0/interpro.xml.gz) and provided as parameter 'interpro_database',
3. we changed the parameter 'prefer_reference_with_go_annos' to 'false'
4. The blast database specific weights used were:
blast_dbs:
swissprot:
weight: 100
description_score_bit_score_weight: 0.2
trembl:
weight: 50
description_score_bit_score_weight: 0.4
tair:
weight: 50
description_score_bit_score_weight: 0.4
The header descriptions for the functional annotation tsv file (Tritim_EIv0.3.release.gff3.pep.fasta.functional_annotation.tsv) is below:
1 #Transcipt - EI transcript ID
2 #Gene - EI gene ID
3 #Confidence - gene confidence class: High - High Confidence, Low - Low Confidence
4 #Biotype - gene biotype classification
5 #AHRD-Blast-Hit-Accession - The Accession of the Protein the assigned description was taken from, by AHRD
6 #AHRD-Quality-Code - From AHRD documentation:
AHRD’s quality-code consists of a three character string, where each character is either ‘*’ if the respective criteria is met or ‘-’ otherwise. Their meaning is explained in the following table:
Position Criteria
1 Bit score of the blast result is >50 and e-value is <e-10
2 Overlap of the blast result is >60%
3 Top token score of assigned HRD is >0.5
7 #Human-Readable-Description - The assigned Human Readable Descriptions by AHRD
8 #Interpro-ID (Description) - The Interproscan ID with description derived by AHRD
9 #Gene-Ontology-Term - The Gene Ontology terms derived by AHRD for the transcript
Functional Annotation Files
- Tritim_EIv0.3.release.functional_annotation.gff3
- Tritim_EIv0.3.release.gff3.pep.fasta.functional_annotation.tsv
Methylation Profile obtained using ccsmeth (https://github.com/PengNi/ccsmeth)
- Tim.S95.v2.hifi.3col.bed