Chromosome-level assembly of two pearl millet (Cenchrus americanus) genomes, functional annotation and transcriptomes
Data files
Feb 12, 2024 version files 257.55 GB
-
AWK_2.r64068_20220227_080231.3_C01.ccs.fastq.gz
28.23 GB
-
AWK_2.r64068_20220227_080231.4_D01.ccs.fastq.gz
26.79 GB
-
AWK_2.r64068_20220308_035507.2_D01.ccs.fastq.gz
26.37 GB
-
Awk_assmb_files.tar.gz
551.50 MB
-
AWK_assmb_transcripts.tar.gz
1.41 GB
-
AWK_DTG-OmniC-419_R1_001.fastq.gz
18.52 GB
-
AWK_DTG-OmniC-419_R2_001.fastq.gz
20.30 GB
-
Awk_genome_assmb_FINAL_v1.fasta.gz
521.70 MB
-
awk_isoseq.tar.gz
146.02 MB
-
AWK_MAKER_Annotation.tar.gz
300.33 MB
-
AWK_rawdata_info.txt
2.36 KB
-
AWK.r64068_20220831_124330_2_B01.ccs.bam
4.98 GB
-
P10_1A.r64068_20220220_071613.3_C01.ccs.fastq.gz
26.54 GB
-
P10_1A.r64068_20220227_080231.5_E01.ccs.fastq.gz
26.16 GB
-
P10_1A.r64068_20220227_080231.6_F01.ccs.fastq.gz
22.77 GB
-
P10k_assmb_files.tar.gz
554.93 MB
-
P10K_assmb_transcripts.tar.gz
1.41 GB
-
P10K_DTG-OmniC-420_R1_001.fastq.gz
22.43 GB
-
P10K_DTG-OmniC-420_R2_001.fastq.gz
23.30 GB
-
P10K_genome_assmb_FINAL_v1.fasta.gz
524.96 MB
-
p10k_isoseq.tar.gz
154.71 MB
-
P10K_MAKER_Annotation.tar.gz
292.51 MB
-
P10K_rawdata_info.txt
2.39 KB
-
P10K.r64068_20220831_124330_1_A01.ccs.bam
5.29 GB
-
README.md
46.44 KB
Abstract
We present platinum-grade reference genome assemblies, featuring gapless chromosomes, for two pearl millet lines: the Striga-susceptible SOSAT-C88-P10 (P10) and the resistant 29Aw (Aw). This study was motivated by the severe impact of the root parasitic weed Striga hermontica on pearl millet yield, with Striga relying on host-released strigolactones (SLs) for seed germination. These resources enable advanced genomic investigations in pearl millet, facilitating research on phenotypic traits, diversity, and adaptation. Additionally, comparative analysis between P10 and 29Aw genomes can pave the way for identifying genetic determinants of Striga resistance, contributing to the development of resilient pearl millet lines.
https://doi.org/10.5061/dryad.nk98sf80k
Description of the data and file structure
Genome assemblies & data
The final assemblies are: Awk_genome_assmb_FINAL_v1.fasta.gz and P10K_genome_assmb_FINAL_v1.fasta.gz for AwK and P10K, respectively. These are the assemblies (with finalised chromosome names) that were used in the gene structure annotation using the MAKER pipeline.
Other assembly related files:
Awk_assmb_files.tar.gz and P10k_assmb_files.tar.gz are tarballs that contain the output from the juicer tool that we used to scaffold the initial hifiasm (assembler tool) HiFi-based assemblies using OmniC data. These tarballs contain: an agp file, an equivalent bed file, a break point report, and the fasta file with the original chromosome names (do not use with gene structure annotation).
Genomic Raw Data:
AwK:
-
PacBio HiFi (SequelII):
- AWK_2.r64068_20220308_035507.2_D01.ccs.fastq.gz
- AWK_2.r64068_20220227_080231.4_D01.ccs.fastq.gz
- AWK_2.r64068_20220227_080231.3_C01.ccs.fastq.gz
-
OmniC (Illumina)
- AWK_DTG-OmniC-419_R1_001.fastq.gz
- AWK_DTG-OmniC-419_R2_001.fastq.gz
P10K
-
PacBio HiFi (SequelII)
- P10_1A.r64068_20220227_080231.6_F01.ccs.fastq.gz
- P10_1A.r64068_20220227_080231.5_E01.ccs.fastq.gz
- P10_1A.r64068_20220220_071613.3_C01.ccs.fastq.gz
-
OmniC (Illumina)
-
P10K_DTG-OmniC-420_R1_001.fastq.gz
-
P10K_DTG-OmniC-420_R2_001.fastq.gz
-
Transcriptome
AWK:
Isoseq:
Raw Data: AWK.r64068_20220831_124330_2_B01.ccs.bam
Analysis: awk_isoseq.tar.gz tarball has the following:
- barcodes.awk.tsv: has barcoding info
-
awk.isoseq.collapse.[gff fasta]: final collapsed full-length isoseq transcripts.
RNA-Seq:
Raw Data: the raw can be downloaded from ENA under study PRJEB71762. We attach AWK_rawdata_info.txt and P10K_rawdata_info.txt which have some information about the RNA-Seq experiments.
Assembled transcripts:
AWK_assmb_transcripts.tar.gz: this tarball contains individual transcripts for each of the 16 samples as well as the merged transcripts file which collapses transcripts from all samples using StringTie merge mode.
P10K:
Isoseq:
Raw Data: P10K.r64068_20220831_124330_1_A01.ccs.bam
Analysis: p10k_isoseq.tar.gz is a tarball that contains
- barcode.p10k.tsv: barcode info
-
p10k.collapsed.gff fasta is the final full-length transcripts
RNA-Seq:
Raw Data: the raw can be downloaded from ENA under study PRJEB71762. We attach AWK_rawdata_info.txt and P10K_rawdata_info.txt which have some information about the RNA-Seq experiments.
Assembled transcripts:
P10K_assmb_transcripts.tar.gz: this tarball contains individual transcripts for each of the 16 samples as well as the merged transcripts file which collapses transcripts from all samples using StringTie merge mode.
Genome Annotation
The genome annotation files (bed, predicted transcripts and proteins, repeat content, and evidence) are attached in AWK_MAKER_Annotation.tar.gz and P10K_MAKER_Annotation.tar.gz. See screenshot below as an example for P10K genome annotation tarball.
\
EV: Mapped evidence from assembled RNA-seq and Iso-Seq data as well as homology evidence. refer to https://github.com/mjfi2sb3/millet-genome-annotation for details.\
HC: High Confidence genes. For details on how a gene is assigned as HC refer to the annotation GitHub\
LC: Low Confidence\
proteins.fasta: protein sequence of predicted mRNAs\
transcripts.fasta: DNA sequence of predicted mRNA.\
XXX.MAKER.gff.gz: main MAKER GFF file which has everything including HC, LC and EV.
Sharing/Access information
Another copy of the raw data (including RNA-Seq) and assemblies is available on ENA under study PRJEB71762.\
Details About the annotation pipeline can be found in the manuscript or our GitHub repo at https://github.com/mjfi2sb3/millet-genome-annotation
For a detailed workflow of the genome annotation part of the work, please refer to the following github repo: https://github.com/mjfi2sb3/millet-genome-annotation