Whole genome sequences of 23 species from the Drosophila montium species group (Diptera: Drosophilidae): a resource for testing evolutionary hypotheses
Data files
Oct 13, 2019 version files 1.45 GB
-
montium_Dryad.tar.gz
1.45 GB
Feb 18, 2020 version files 1.61 GB
-
montium_Dryad_2.tar.gz
160.67 MB
-
montium_Dryad.tar.gz
1.45 GB
Abstract
Large groups of species with well-defined phylogenies are excellent systems for testing evolutionary hypotheses. In this paper, we describe the creation of a comparative genomic resource consisting of 23 genomes from the species-rich Drosophila montium species group, 22 of which are presented here for the first time. The montium group is uniquely positioned for comparative studies. Within the montium clade, evolutionary distances are such that large numbers of sequences can be accurately aligned while also recovering strong signals of divergence; and the distance between the montium group and D. melanogaster is short enough so that orthologous sequence can be readily identified. All genomes were assembled from a single, small-insert library using MaSuRCA, before going through an extensive post-assembly pipeline. Estimated genome sizes within the montium group range from 155 Mb to 223 Mb (mean=196 Mb). The absence of long-distance information during the assembly process resulted in fragmented assemblies, with the scaffold NG50s varying widely based on repeat content and sample heterozygosity (min=18 kb, max=390 kb, mean=74 kb). The total scaffold length for most assemblies is also shorter than the estimated genome size, typically by 5 - 15 %. However, subsequent analysis showed that our assemblies are highly complete. Despite large differences in contiguity, all assemblies contain at least 96 % of known single-copy Dipteran genes (BUSCOs, n=2,799). Similarly, by aligning our assemblies to the D. melanogaster genome and remapping coordinates for a large set of transcriptional enhancers (n=3,457), we showed that each montium assembly contains orthologs for at least 91 % of D. melanogaster enhancers. Importantly, the genic and enhancer contents of our assemblies are comparable to that of far more contiguous Drosophila assemblies. The alignment of our own D. serrata assembly to a previously published PacBio D. serrata assembly also showed that our longest scaffolds (up to 1 Mb) are free of large-scale misassemblies. Our genome assemblies are a valuable resource that can be used to further resolve the montium group phylogeny; study the evolution of protein-coding genes and cis-regulatory sequences; and determine the genetic basis of ecological and behavioral adaptations.
This data archive contains repeat-masked assemblies, RepeatMasker annotation / summary tables, and liftOver chain files for 23 montium genomes. It also includes BUSCO assessment tables for D. bocki, D. burlai, D. jambulina, D. kanapiae, D. mayri, D. melanogaster, D. pectinifera, D. rufa, and D. triauraria. The file Species_Names.tab is a key for the species abbreviations used in the filenames. Repeats were soft-masked using RepeatMasker. The liftOver chain files were created by individually aligning each montium assembly to the D. melanogaster genome using a previously described whole genome alignment pipeline. Detailed methods are described in a forthcoming paper. Given a set of coordinates for annotated features in the D. melanogaster genome (in either BED or GFF format), and a liftOver chain file, the liftOver utility returns coordinates for (putatively) orthologous sequences in an aligned montium genome. The liftOver utility can be downloaded by following the link to the utilities directory on this webpage: http://hgdownload.soe.ucsc.edu/downloads.html. The webpage also contains instructions for making the utility executable.
The un-masked assemblies and raw sequencing data are available separately through the Drosophila montium Species Group Genomes Project, NCBI BioProject Accession PRJNA554346.