Chang, Ching-Ho1; Chanvan, Ankita2; Palladino, Jason2; Wei, Xiaolu1; Martins, Nuno M. C.3; Santinello, Bryce2; Chen, Chin-Chi2; Erceg, Jelena3; Beliveau, Brian J.3; Wu, Chao-Ting3; Larracuente, Amanda M.1; Mellone, Barbara G2

Published May 16, 2019 on Dryad. https://doi.org/10.5061/dryad.rb1bt3j

Abstract

Centromeres are essential chromosomal regions that mediate kinetochore assembly and spindle attachments during cell division. Despite their functional conservation, centromeres are amongst the most rapidly evolving genomic regions and can shape karyotype evolution and speciation across taxa. Although significant progress has been made in identifying centromere-associated proteins, the highly repetitive centromeres of metazoans have been refractory to DNA sequencing and assembly, leaving large gaps in our understanding of their functional organization and evolution. Here, we identify the sequence composition and organization of the centromeres of Drosophila melanogaster by combining long-read sequencing, chromatin immunoprecipitation for the centromeric histone CENP-A, and high-resolution chromatin fiber imaging. Contrary to previous models that heralded satellite repeats as the major functional components, we demonstrate that functional centromeres form on islands of complex DNA sequences enriched in retroelements that are flanked by large arrays of satellite repeats. Each centromere displays distinct size and arrangement of its DNA elements but is similar in composition overall. We discover that a specific retroelement, G2/Jockey-3, is the most highly enriched sequence in CENP-A chromatin and is the only element shared among all centromeres. G2/Jockey-3 is also associated with CENP-A in the sister species Drosophila simulans, revealing an unexpected conservation despite the reported turnover of centromeric satellite DNA. Our work reveals the DNA sequence identity of the active centromeres of a premier model organism and implicates retroelements as conserved features of centromeric DNA.

FileS1. Custom repeat library

We created a custom Drosophila-specific consensus repeat library modified from RepBase v20150807 to include all complex satellite DNAs from Drosophila melanogaster.

File.S1.Chang_et_al.fasta

FileS2. ChIPtigs from R1 library

We created de novo contigs from ChIPseq reads (ChIPtigs) with Spades v3.11.0 (-t 24 -careful –sc;) for the R1 library.

File.S2.Chang_et_al.fasta

FileS3. ChIPtigs from the R2 library

We subsampled reads from R2 ChIP-seq to 100x coverage using BBnorm (v37.54) with the parameters "threads=24 prefilter=t target=100", and created de novo contigs from the subsampled ChIPseq reads (ChIPtigs) with Spades v3.11.0 (-t 24 -careful –sc;).

File.S3.Chang_et_al.fasta

FileS4. ChIPtigs from the R3 library

We created de novo contigs from ChIPseq reads (ChIPtigs) with Spades v3.11.0 (-t 24 -careful –sc;) for the R3 library.

File.S4.Chang_et_al.fasta

FileS5. ChIPtigs from the R4 library

We created de novo contigs from ChIPseq reads (ChIPtigs) with Spades v3.11.0 (-t 24 -careful –sc;) for the R4 library.

File.S5.Chang_et_al.fasta

FileS6. ChIPtigs from the S2 library

We created de novo contigs from S2 ChIPseq reads (ChIPtigs) with Spades v3.11.0 (-t 24 -careful –sc;) from the S2 library.

File.S6.Chang_et_al.fasta

FileS7. Hybrid PacBio-Nanopore assembly

We assembled nanopore (Solare et al. 2018) and PacBio reads (Kim et al. 2014) into a hybrid assembly using Canu v1.7 with default settings. The assembly size is 162,798,260 bp with N50=5,104,646 bp.

File.S7.Chang_et_al.fasta

FileS8. PacBio-only assembly with extra contigs

We use the PacBio-only assembly in Chang and Larracuente 2018 and added 19 sequences with CENP-A-enriched repeats. Of these 19 sequences, 6 were contigs from the hybrid PacBio-Nanopore assembly (File S7) and the rest were error-corrected PacBio reads.

File.S8.Chang_et_al.fasta

FileS9. Repeat annotation file for the finished PacBio-only assembly with extra contigs

We annotated the finished assembly using our custom repeat library (-lib library.fasta -s) and RepeatMasker 4.06.

File.S9.Chang_et_al.gff.txt

FileS10. Gene annotation file for the PacBio-only assembly with extra contigs

We transferred gene annotations from Flybase r6.20 to our genome using BLAT and CrossMap v0.2.5.

File.S10.Chang_et_al.gff

File S11. The sequences of Stellaris probes for Rsp

The following Stellaris probes are tagged with Quasar 570 and used to detect Rsp sequences.

File.S11.Chang_et_al.txt

File S12. The fasta alignment of genomic IGS sequences from D. melanogaster and outgroup species

We extracted all IGS elements from the genome using BLAST v2.7.1 with parameters “- task blastn -num_threads 24 -qcov_hsp_perc 90” and custom scripts. We then aligned and manually inspected IGS sequences using Geneious v8.1.6.

File.S12.Chang_et_al.fasta

File S13. A fasta alignment of G2/Jockey-3 sequences from different contigs and outgroup species

We extracted the G2/Jockey-3 sequences based on Repeatmasker annotations and custom scripts. We then aligned and manually inspected G2/Jockey-3 sequences using Geneious v8.1.6.

File.S13.Chang_et_al.fasta

File S14. The newick consensus tree of IGS sequences inferred using RAxML

We constructed maximum likelihood phylogenetic trees for IGS using RAxML v.8.2.11 with parameters “-m GTRGAMMA -T24 -d -p 12345 -# autoMRE -k -x 12345 -f a”.

File.S14.Chang_et_al.nwk

File S15. The newick consensus tree of G2/Jockey-3 sequences inferred using RAxML

We constructed maximum likelihood phylogenetic trees for G2/Jockey-3 using RAxML v.8.2.11 with parameters “-m GTRGAMMA -T24 -d -p 12345 -# autoMRE -k -x 12345 -f a”.

File.S15.Chang_et_al.nwk

File S16. Oligopaint coordinates and sequences

Oligopaints sequences and information for centromeres X, 3, 4, and Y. The columns indicate the centromere contig ID, start and end coordinates of sequence, followed by the oligo sequence, and the melting temperature (all.oligos.cen.islands). Included are also the same Oligopaint sequences with 5’ and 3’ extensions containing the universal primer followed by library-specific barcodes (oligos.with.adaptors).

File.S16.Chang_et_al.xlsx

Data from: Islands of retroelements are major components of Drosophila centromeres

Data files

Abstract

FileS1. Custom repeat library

FileS2. ChIPtigs from R1 library

FileS3. ChIPtigs from the R2 library

FileS4. ChIPtigs from the R3 library

FileS5. ChIPtigs from the R4 library

FileS6. ChIPtigs from the S2 library

FileS7. Hybrid PacBio-Nanopore assembly

FileS8. PacBio-only assembly with extra contigs

FileS9. Repeat annotation file for the finished PacBio-only assembly with extra contigs

FileS10. Gene annotation file for the PacBio-only assembly with extra contigs

File S11. The sequences of Stellaris probes for Rsp

File S12. The fasta alignment of genomic IGS sequences from D. melanogaster and outgroup species

File S13. A fasta alignment of G2/Jockey-3 sequences from different contigs and outgroup species

File S14. The newick consensus tree of IGS sequences inferred using RAxML

File S15. The newick consensus tree of G2/Jockey-3 sequences inferred using RAxML

File S16. Oligopaint coordinates and sequences

Data from: Islands of retroelements are major components of Drosophila centromeres

Data files

Abstract

Usage notes

FileS1. Custom repeat library

FileS2. ChIPtigs from R1 library

FileS3. ChIPtigs from the R2 library

FileS4. ChIPtigs from the R3 library

FileS5. ChIPtigs from the R4 library

FileS6. ChIPtigs from the S2 library

FileS7. Hybrid PacBio-Nanopore assembly

FileS8. PacBio-only assembly with extra contigs

FileS9. Repeat annotation file for the finished PacBio-only assembly with extra contigs

FileS10. Gene annotation file for the PacBio-only assembly with extra contigs

File S11. The sequences of Stellaris probes for Rsp

File S12. The fasta alignment of genomic IGS sequences from D. melanogaster and outgroup species

File S13. A fasta alignment of G2/Jockey-3 sequences from different contigs and outgroup species

File S14. The newick consensus tree of IGS sequences inferred using RAxML

File S15. The newick consensus tree of G2/Jockey-3 sequences inferred using RAxML

File S16. Oligopaint coordinates and sequences

Works referencing this dataset