Skip to main content

Transposable element accumulation drives size differences among polymorphic Y chromosomes in Drosophila

Cite this dataset

Nguyen, Alison (2022). Transposable element accumulation drives size differences among polymorphic Y chromosomes in Drosophila [Dataset]. Dryad.


Y chromosomes of many species are gene poor and show low levels of nucleotide variation, yet often display high amounts of structural diversity. Dobzhansky cataloged several morphologically distinct Y chromosomes in Drosophila pseudoobscura that differ in size and shape, but the molecular causes of their dramatic size differences are unclear. Here we use cytogenetics and long-read sequencing to study the sequence content of polymorphic Y chromosomes in D. pseudoobscura. We show that Y chromosomes differ by almost 2-fold in size, ranging from 30 to 60 Mb. Most of this size difference is caused by a handful of active transposable elements (TEs) that have recently expanded on the largest Y chromosome, with different elements being responsible for Y expansion on differently sized D. pseudoobscura Y’s. We show that Y chromosomes differ in their heterochromatin enrichment, expression of Y-enriched TEs, and also influence expression of dozens of autosomal and X-linked genes. Intriguingly, the same helitron element that showed the most drastic amplification on the largest Y in D. pseudoobscura independently amplified on a polymorphic large Y chromosome in D. affinis, suggesting that some TEs are inherently more prone to become deregulated on Y chromosomes.


This data set contains the key output files from the Bionano hybrid assembly workflow used in "Transposable element accumulation drives size differences among polymorphic Y chromosomes in Drosophila." The data is from a Drosophila pseudoobscura male that has a large Y chromosome.

All data was generated from high molecular gDNA derived from whole body tissue of a D. pseudoobscura male. The DLS labeling kit was used to tag the DNA and the labeled DNA was run through the Saphyr platform. The Bionano molecule data was used with a NGS-based male D. pseudoobscura genome to generate a hybrid assembly. : hybrid assembly sequences (fasta files), how NGS contigs were stitched together with Bionano data and the adjustments (.agp, .gap), the log file (.log), and the general summary statistics of the pipeline.

Usage notes

Fasta sequences: 

  • EXP_REFINEFINAL1_bppAdjust_cmap_LarY_CFFC_QM2_Ys_UnstitchedAutoX_v3_fasta_NGScontigs_HYBRID_SCAFFOLD.fasta: NGS sequences that were stitched together using information from the Bionano data. Note that these scaffolds will contain gaps where there was no sequence information from the NGS data.
  • EXP_REFINEFINAL1_bppAdjust_cmap_LarY_CFFC_QM2_Ys_UnstitchedAutoX_v3_fasta_NGScontigs_HYBRID_SCAFFOLD_NOT_SCAFFOLDED.fasta: NGS sequences that were not stitched together or were cut out during the Bionano hybrid assembly pipeline.
  • LarY_CFFC_QM2_Ys_UnstitchedAutoX_v3.fasta.cut.fasta: the original NGS sequences that were used for the hybrid assembly pipeline but were cut to eventually stitch together. This includes sequences from the autosomes, X chromosome, and Y chromosome. 

Assembly stitching information:

  • EXP_REFINEFINAL1_bppAdjust_cmap_LarY_CFFC_QM2_Ys_UnstitchedAutoX_v3_fasta_NGScontigs_HYBRID_SCAFFOLD.agp: details how each Hybrid scaffold was built from the NGS assembly contigs. 
    • Note: the sequences from the first column can be found in EXP_REFINEFINAL1_bppAdjust_cmap_LarY_CFFC_QM2_Ys_UnstitchedAutoX_v3_fasta_NGScontigs_HYBRID_SCAFFOLD.fasta (with the start and end coordinates recorded in columns 2 and 3) and the sequences in the 6th column can be found in LarY_CFFC_QM2_Ys_UnstitchedAutoX_v3.fasta.cut.fasta (with the start and end coordinates in columns 7 and 8). The strandedness is in column 9. 

Pipeline overview:

  • EXP_REFINEFINAL1_bppAdjust_cmap_LarY_CFFC_QM2_Ys_UnstitchedAutoX_v3_fasta_HYBRID_SCAFFOLD_log.txt: contains the entire log from the workflow and includes also the summary statistics for several analyses. It is included for completeness and to track the workflow version that was used. 
  • hybrid_scaffold_informatics_report.txt: a file that contains the overall summary statistics of the hybrid scaffold assembly compared to the NGS-only and Bionano-only assemblies. It includes genome assembly sizes and general geneome assembly statistics (N50, max and min contig lengths).