Skip to main content
Dryad

Progressive Cactus alignment of 298 drosophilid species

Data files

Dec 01, 2023 version files 68.23 GB

Select up to 11 GB of files for download

Abstract

Long-read sequencing is driving rapid progress in genome assembly across all major groups of life, including species of the family Drosophilidae, a longtime model system for genetics, genomics, and evolution. Whole-genome sequence alignments link evolution at the nucleotide level across species and are a critical but computationally intensive step for downstream genomic analyses. Progressive Cactus is a reference-free, whole-genome alignment tool designed to scale to alignments of thousands of species.

In the study associated with this dataset, we conducted Oxford Nanopore long-read sequencing of both inbred lines and single wild flies obtained either directly from the field or from ethanol-preserved specimens in museum collections. We selected a set of 298 suitably high-quality drosophilid genomes from this study, from publicly available genomes assembled previously by us, and genomes assembled by other studies. Repeats were identified and soft-masked in each genome with RepeatModeler2 and RepeatMasker. A guide tree was constructed from 1,000 single-copy orthologs annotated by BUSCO v5 in all genomes. Individual gene trees were inferred with IQTREE2 and a species tree was estimated from the gene trees with ASTRAL-MP. The tree was scaled by the substitution rate at 4-fold degenerate sites and provided to Progressive Cactus as the guide tree for the alignment. Detailed methods are provided in the study.

The alignment is released as an open resource and as a tool for studying evolution at the scale of an entire insect family.