Centromeres are essential chromosomal regions that mediate kinetochore assembly and spindle attachments during cell division. Despite their functional conservation, centromeres are amongst the most rapidly evolving genomic regions and can shape karyotype evolution and speciation across taxa. Although significant progress has been made in identifying centromere-associated proteins, the highly repetitive centromeres of metazoans have been refractory to DNA sequencing and assembly, leaving large gaps in our understanding of their functional organization and evolution. Here, we identify the sequence composition and organization of the centromeres of Drosophila melanogaster by combining long-read sequencing, chromatin immunoprecipitation for the centromeric histone CENP-A, and high-resolution chromatin fiber imaging. Contrary to previous models that heralded satellite repeats as the major functional components, we demonstrate that functional centromeres form on islands of complex DNA sequences enriched in retroelements that are flanked by large arrays of satellite repeats. Each centromere displays distinct size and arrangement of its DNA elements but is similar in composition overall. We discover that a specific retroelement, G2/Jockey-3, is the most highly enriched sequence in CENP-A chromatin and is the only element shared among all centromeres. G2/Jockey-3 is also associated with CENP-A in the sister species Drosophila simulans, revealing an unexpected conservation despite the reported turnover of centromeric satellite DNA. Our work reveals the DNA sequence identity of the active centromeres of a premier model organism and implicates retroelements as conserved features of centromeric DNA.
FileS1. Custom repeat library
We created a custom Drosophila-specific consensus repeat library modified from RepBase v20150807 to include all complex satellite DNAs from Drosophila melanogaster.
File.S1.Chang_et_al.fasta
FileS2. ChIPtigs from R1 library
We created de novo contigs from ChIPseq reads (ChIPtigs) with Spades v3.11.0 (-t 24 -careful –sc;) for the R1 library.
File.S2.Chang_et_al.fasta
FileS3. ChIPtigs from the R2 library
We subsampled reads from R2 ChIP-seq to 100x coverage using BBnorm (v37.54) with the parameters "threads=24 prefilter=t target=100", and created de novo contigs from the subsampled ChIPseq reads (ChIPtigs) with Spades v3.11.0 (-t 24 -careful –sc;).
File.S3.Chang_et_al.fasta
FileS4. ChIPtigs from the R3 library
We created de novo contigs from ChIPseq reads (ChIPtigs) with Spades v3.11.0 (-t 24 -careful –sc;) for the R3 library.
File.S4.Chang_et_al.fasta
FileS5. ChIPtigs from the R4 library
We created de novo contigs from ChIPseq reads (ChIPtigs) with Spades v3.11.0 (-t 24 -careful –sc;) for the R4 library.
File.S5.Chang_et_al.fasta
FileS6. ChIPtigs from the S2 library
We created de novo contigs from S2 ChIPseq reads (ChIPtigs) with Spades v3.11.0 (-t 24 -careful –sc;) from the S2 library.
File.S6.Chang_et_al.fasta
FileS7. Hybrid PacBio-Nanopore assembly
We assembled nanopore (Solare et al. 2018) and PacBio reads (Kim et al. 2014) into a hybrid assembly using Canu v1.7 with default settings. The assembly size is 162,798,260 bp with N50=5,104,646 bp.
File.S7.Chang_et_al.fasta
FileS8. PacBio-only assembly with extra contigs
We use the PacBio-only assembly in Chang and Larracuente 2018 and
added 19 sequences with CENP-A-enriched repeats. Of these 19 sequences, 6 were contigs from the hybrid PacBio-Nanopore assembly (File S7) and the rest were error-corrected PacBio reads.
File.S8.Chang_et_al.fasta
FileS9. Repeat annotation file for the finished PacBio-only assembly with extra contigs
We annotated the finished assembly using our custom repeat library (-lib library.fasta -s) and RepeatMasker 4.06.
File.S9.Chang_et_al.gff.txt
FileS10. Gene annotation file for the PacBio-only assembly with extra contigs
We transferred gene annotations from Flybase r6.20 to our genome using BLAT and CrossMap v0.2.5.
File.S10.Chang_et_al.gff
File S11. The sequences of Stellaris probes for Rsp
The following Stellaris probes are tagged with Quasar 570 and used to detect Rsp sequences.
File.S11.Chang_et_al.txt
File S12. The fasta alignment of genomic IGS sequences from D. melanogaster and outgroup species
We extracted all IGS elements from the genome using BLAST v2.7.1 with parameters “- task blastn -num_threads 24 -qcov_hsp_perc 90” and custom scripts. We then aligned and manually inspected IGS sequences using Geneious v8.1.6.
File.S12.Chang_et_al.fasta
File S13. A fasta alignment of G2/Jockey-3 sequences from different contigs and outgroup species
We extracted the G2/Jockey-3 sequences based on Repeatmasker annotations and custom scripts. We then aligned and manually inspected G2/Jockey-3 sequences using Geneious v8.1.6.
File.S13.Chang_et_al.fasta
File S14. The newick consensus tree of IGS sequences inferred using RAxML
We constructed maximum likelihood phylogenetic trees for IGS using RAxML v.8.2.11 with parameters “-m GTRGAMMA -T24 -d -p 12345 -# autoMRE -k -x 12345 -f a”.
File.S14.Chang_et_al.nwk
File S15. The newick consensus tree of G2/Jockey-3 sequences inferred using RAxML
We constructed maximum likelihood phylogenetic trees for G2/Jockey-3 using RAxML v.8.2.11 with parameters “-m GTRGAMMA -T24 -d -p 12345 -# autoMRE -k -x 12345 -f a”.
File.S15.Chang_et_al.nwk
File S16. Oligopaint coordinates and sequences
Oligopaints sequences and information for centromeres X, 3, 4, and Y. The columns indicate the centromere contig ID, start and end coordinates of sequence, followed by the oligo sequence, and the melting temperature (all.oligos.cen.islands). Included are also the same Oligopaint sequences with 5’ and 3’ extensions containing the universal primer followed by library-specific barcodes (oligos.with.adaptors).
File.S16.Chang_et_al.xlsx