Chromosome-level assembly of two pearl millet (Cenchrus americanus) genomes, functional annotation and transcriptomes

Kuijer, Hendrik N. J.1 ; Wang, Jian You1 ; Bougouffa, Salim 1 ; Abrouk, Michael 1 ; Jamil, Muhammad1 ; Incitti, Roberto1 ; Alam, Intikhab1 ; Al-Babili, Salim1

Published Feb 12, 2024 on Dryad. https://doi.org/10.5061/dryad.nk98sf80k

Abstract

We present platinum-grade reference genome assemblies, featuring gapless chromosomes, for two pearl millet lines: the Striga-susceptible SOSAT-C88-P10 (P10) and the resistant 29Aw (Aw). This study was motivated by the severe impact of the root parasitic weed Striga hermontica on pearl millet yield, with Striga relying on host-released strigolactones (SLs) for seed germination. These resources enable advanced genomic investigations in pearl millet, facilitating research on phenotypic traits, diversity, and adaptation. Additionally, comparative analysis between P10 and 29Aw genomes can pave the way for identifying genetic determinants of Striga resistance, contributing to the development of resilient pearl millet lines.

https://doi.org/10.5061/dryad.nk98sf80k

Description of the data and file structure

Genome assemblies & data

The final assemblies are: Awk_genome_assmb_FINAL_v1.fasta.gz and P10K_genome_assmb_FINAL_v1.fasta.gz for AwK and P10K, respectively. These are the assemblies (with finalised chromosome names) that were used in the gene structure annotation using the MAKER pipeline.

Other assembly related files:

Awk_assmb_files.tar.gz and P10k_assmb_files.tar.gz are tarballs that contain the output from the juicer tool that we used to scaffold the initial hifiasm (assembler tool) HiFi-based assemblies using OmniC data. These tarballs contain: an agp file, an equivalent bed file, a break point report, and the fasta file with the original chromosome names (do not use with gene structure annotation).

Genomic Raw Data:

AwK:

PacBio HiFi (SequelII):
1. AWK_2.r64068_20220308_035507.2_D01.ccs.fastq.gz
2. AWK_2.r64068_20220227_080231.4_D01.ccs.fastq.gz
3. AWK_2.r64068_20220227_080231.3_C01.ccs.fastq.gz
OmniC (Illumina)
1. AWK_DTG-OmniC-419_R1_001.fastq.gz
2. AWK_DTG-OmniC-419_R2_001.fastq.gz

P10K

PacBio HiFi (SequelII)
1. P10_1A.r64068_20220227_080231.6_F01.ccs.fastq.gz
2. P10_1A.r64068_20220227_080231.5_E01.ccs.fastq.gz
3. P10_1A.r64068_20220220_071613.3_C01.ccs.fastq.gz
OmniC (Illumina)
1. P10K_DTG-OmniC-420_R1_001.fastq.gz
2. P10K_DTG-OmniC-420_R2_001.fastq.gz

Transcriptome

AWK:

Isoseq:

Raw Data: AWK.r64068_20220831_124330_2_B01.ccs.bam

Analysis: awk_isoseq.tar.gz tarball has the following:

barcodes.awk.tsv: has barcoding info
awk.isoseq.collapse.[gff|fasta]: final collapsed full-length isoseq transcripts.

RNA-Seq:

Raw Data: the raw can be downloaded from ENA under study PRJEB71762. We attach AWK_rawdata_info.txt and P10K_rawdata_info.txt which have some information about the RNA-Seq experiments.

Assembled transcripts:

AWK_assmb_transcripts.tar.gz: this tarball contains individual transcripts for each of the 16 samples as well as the merged transcripts file which collapses transcripts from all samples using StringTie merge mode.

P10K:

Isoseq:

Raw Data: P10K.r64068_20220831_124330_1_A01.ccs.bam

Analysis: p10k_isoseq.tar.gz is a tarball that contains

barcode.p10k.tsv: barcode info
p10k.collapsed.gff|fasta is the final full-length transcripts

RNA-Seq:

Raw Data: the raw can be downloaded from ENA under study PRJEB71762. We attach AWK_rawdata_info.txt and P10K_rawdata_info.txt which have some information about the RNA-Seq experiments.

Assembled transcripts:

P10K_assmb_transcripts.tar.gz: this tarball contains individual transcripts for each of the 16 samples as well as the merged transcripts file which collapses transcripts from all samples using StringTie merge mode.

Genome Annotation

The genome annotation files (bed, predicted transcripts and proteins, repeat content, and evidence) are attached in AWK_MAKER_Annotation.tar.gz and P10K_MAKER_Annotation.tar.gz. See screenshot below as an example for P10K genome annotation tarball.

EV: Mapped evidence from assembled RNA-seq and Iso-Seq data as well as homology evidence. refer to https://github.com/mjfi2sb3/millet-genome-annotation for details.
HC: High Confidence genes. For details on how a gene is assigned as HC refer to the annotation GitHub
LC: Low Confidence
proteins.fasta: protein sequence of predicted mRNAs
transcripts.fasta: DNA sequence of predicted mRNA.
XXX.MAKER.gff.gz: main MAKER GFF file which has everything including HC, LC and EV.

Sharing/Access information

Another copy of the raw data (including RNA-Seq) and assemblies is available on ENA under study PRJEB71762.
Details About the annotation pipeline can be found in the manuscript or our GitHub repo at https://github.com/mjfi2sb3/millet-genome-annotation

Chromosome-level assembly of two pearl millet (Cenchrus americanus) genomes, functional annotation and transcriptomes

Data files

Abstract

README: Chromosome-level assembly of two pearl millet (Cenchrus americanus) genomes, functional annotation and transcriptomes.

Description of the data and file structure

Sharing/Access information

Methods

Works referencing this dataset