Data for: Highly contiguous genome assembly of Drosophila prolongata – a model for evolution of sexual dimorphism and male-specific innovations
Data files
Mar 04, 2024 version files 10.72 GB
-
blast.tar.gz
-
busco.tar.gz
-
Dovetail.tar.gz
-
Final_Versions.tar.gz
-
gff_combine.tar.gz
-
mummer.tar.gz
-
README.md
Abstract
Drosophila prolongata is a member of the melanogaster species group and rhopaloa subgroup native to the subtropical highlands of southeast Asia. This species exhibits an array of recently evolved male-specific morphological, physiological, and behavioral traits that distinguish it from its closest relatives, making it an attractive model for studying the evolution of sexual dimorphism and testing theories of sexual selection. The lack of genomic resources has impeded the dissection of the molecular basis of sex-specific development and behavior in this species. To address this, we assembled the genome of D. prolongata using long-read sequencing and Hi-C scaffolding, resulting in a highly complete and contiguous (scaffold N50 2.2Mb) genome assembly of 220Mb. The repetitive content of the genome is 24.6%, the plurality of which are LTR retrotransposons (33.2%). Annotations based on RNA-seq data and homology to related species revealed a total of 19,330 genes, of which 16,170 are protein-coding. The assembly includes 98.5% of Diptera BUSCO genes, including 93.8% present as a single copy. Despite some likely regional duplications, the completeness of this genome suggests that it can be readily used for gene expression, GWAS, and other genomic analyses.
README: Data for: Highly Contiguous Genome Assembly of Drosophila prolongata - a Model for Evolution of Sexual Dimorphism and Male-specific Innovations
https://doi.org/10.5061/dryad.mpg4f4r6w
Genome annotations for D. prolongata and D. carrolli, described in D. prolongata genome report (Luecke et al 2024).
Fasta files with scaffolds identified as duplicate sequence and removed from intermediate D. prolongata assembly.
All versions of D. prolongata genome assembly: Final version described in paper and submitted to NCBI, Dovetail HiRise assembly including duplicate scaffolds, and Dovetail Falcon assembly before HiC scaffolding.
Supporting data provided by Dovetail Genomics alongside completed HiRise assembly.
Full results and intermediate files for BLAST, BUSCO, and mummer analyses.
Description of the data and file structure
Final Versions of Assembly and Annotation
The Final_Versions/
directory has all Final Version sequences and gene annotations.
The Final_Versions/prolongata/
subdirectory has all files related to D. prolongata: The deduplicated assembly sequences prolongataSaPa_WGS-DeDup.fa
and corresponding annotation prolongataSaPa_WGS-DeDup.gff
; for scaffolds removed as duplicate (i.e. alternate haplotigs) the sequences prolongataSaPa_WGS-RemovedDups.fa
and corresponding annotation prolongataSaPa_WGS-RemovedDups.gff
.
The Final_Versions/carrolli/
subdirectory has the gene annotation file carrolli_GCA_018152295.1.gff
for the publicly available D. carrolli assembly version GCA_018152295.1.
Intermediate Versions of Assembly and Annotation
The Dovetail/
directory has files, data, and reports provided by Dovetail Genomics in producing the HiRise assembly. The HiC_HiCRise_MLkUi/
subdirectory includes information on the HiC scaffolding process and includes the final Dovetail assembly drosophila_prolongata_29Dec2017_MLkUi.fasta.gz
and the bam file for mapped HiC reads. The PBassembly/
subdirectory contains information and data from the PacBio sequencing and Falcon assembly process, including the p_ctg.fasta.gz
primary assembly and a_ctg.fasta.gz
associated haplotig scaffold sequences.
The gff_combine/
directory has the intermediate genome annotation files which were combined to produce the annotations available in Final_Versions/
. They split first by species (subdirectories carrolli/
and prolongata/
), then by program used to generate the annotation (subdirectories liftoff/
and maker/
). These intermediate annotations were combined using a pipeline detailed in the annotation_tools github repo.
Assessing Assembly and Annotations
The busco/
directory has all results from standard BUSCO analyses; runs labeled "genome" are for assemblies, runs for "transcripts" are against transcript sequences extracted from the assembly using the relevant genome annotation. Included are analyses of focal species D. prolongata and D. carrolli, and of reference species D. melanogaster and D. rhopaloa. Results from D. prolongata after removal of duplicate scaffolds are labeled "dedup".
The mummer/species_alignments/
subdirectory has primary results (delta files) from mummer alignments against reference species and tsv files used for alignment plotting, split into melanogaster/
and rhopaloa/
subdirectories. The value for the c parameter used (seed alignment length) is included in file names. delta files were converted into tsv files and plotted in R with a pipeline detailed in the annotation_tools github repo.
Identification of Duplicate Scaffolds/Regions
The BUSCO-based pipeline to identify and remove duplicate scaffolds using the BUSCO results file busco/prolongata_genome_01/run_diptera_odb10/full_table.tsv
is detailed in the annotation_tools github repo.
The mummer/confirm_duplicates/
subdirectory has delta results file for mummer alignments between removed duplicate scaffolds and retained scaffolds with corresponding duplicate sequence, along with tsv files used for alignment plotting as detailed in the annotation_tools github repo.
The blast/
directory has files used in the reciprocal BLAST-based tagging of candidate duplicate genes in the D. prolongata assembly, including results for all-against-all BLAST searches of gene region sequences from the D. prolongata and D. rhopaloa annotations, along with the gene region sequences from both species assemblies which were used as queries and references. The AllVsAll/
and AllVsRhop/
subdirectories have results for D. prolongata sequences queried against D. prolongata sequences and against D. rhopaloa sequences respectively. The pipeline to extract gene region sequences and process BLAST results is detailed in the annotation_tools github repo.
Sharing/Access information
This is a section for linking to other ways to access the data, and for linking to sources the data is derived from, if any.
Links to other publicly accessible locations of the assembly, and RNA-seq data used for MAKER annotation runs, can be found here:
The assembly was produced in collaboration with Dovetail genomics:
Code/Software
Custom scripts used for these analysis is available in this public GitHub repository: