Turnover of retroelements and satellite DNA drives centromere reorganization over short evolutionary timescales in Drosophila

Courret, Cecile1 ; Hemmer, Lucas W.1 ; Wei, Xiaolu1 ; Patel, Prachi D.2 ; Chabot, Bryce J.2 ; Fuda, Nicholas J.1 ; Geng, Xuewen1 ; Chang, Ching-Ho 3 ; Mellone, Barbara2 ; Larracuente, Amanda M.1

Published Nov 22, 2024 on Dryad. https://doi.org/10.5061/dryad.1zcrjdg2g

Data files

Nov 22, 2024 version files 465.27 MB

Coverage.zip

464.73 MB
Macs2_Peaks.zip

499.53 KB
Phylogeny.zip

20.23 KB
README.md

12.50 KB

Abstract

Centromeres reside in rapidly evolving, repeat-rich genomic regions, despite their essential function in chromosome segregation. Across organisms, centromeres are rich in selfish genetic elements such as transposable elements and satellite DNAs that can bias their transmission through meiosis. However, these elements still need to cooperate at some level and contribute to, or avoid interfering with, centromere function. To gain insight into the balance between conflict and cooperation at centromeric DNA, we take advantage of the close evolutionary relationships within the Drosophila simulans clade – D. simulans, D. sechellia, and D. mauritiana – and their relative, D. melanogaster. Using chromatin profiling combined with high resolution fluorescence in situ hybridization on stretched DNA, we characterize all centromeres across these species. We discovered dramatic centromere reorganization involving recurrent shifts between retroelements and satellite DNAs over short evolutionary timescales. We also reveal the recent origin (<240 Kya) of telocentric chromosomes in D. sechellia, where the X and 4^th centromeres now sit on telomere-specific retroelements. Finally, the Y chromosome centromeres, which are the only chromosomes that do not experience female meiosis, do not show dynamic cycling between satDNA and TEs. The patterns of rapid centromere turnover in these species are consistent with genetic conflicts in the female germline. and have implications for centromeric DNA function and karyotype evolution. Regardless of the evolutionary forces driving this turnover, the rapid reorganization of centromeric sequences over short evolutionary timescales highlights their potential as hotspots for evolutionary innovation.

https://doi.org/10.5061/dryad.1zcrjdg2g

Description of the data and file structure

Overview:

This repository contains data and code for analysis for Courret et al. 2024 (https://doi.org/10.1101/2023.08.22.554357). In this manuscript, we study centromere evolution in the Drosophila simulans clade – D. simulans, D. sechellia, and D. mauritiana – compared to their relative, D. melanogaster. We use chromatin profiling (CUT&Tag) combined with high resolution fluorescence in situ hybridization on stretched DNA, and evolutionary analyses to characterize and compare all centromeres across these species.

CUT&Tag:

We performed CUT&Tag using around 100,000 nuclei per sample. We used the pA-Tn5 enzyme from Epicypher and followed the manufacturer's protocol (CUT&Tag Protocol v1.5). For each species we performed 3 replicates with the anti-CID20 antibody (1:50), one positive control using anti-H3K9me3 (1:100), and one negative control using the anti-IgG antibody (1:100).

For the library preparation, we used the primers in S8 Table of Courret et al. 2024. We analyzed each library on Bioanalyzer for quality control, representative profiles of CENP-A and H3K27me3 profiles are provided in S11B Fig. Before final sequencing, we pooled 2µl of each library and performed a MiSeq run. We used the number of resulting reads from each library to estimate the relative concentration of each library and ensure an equal representation of each library in the final pool for sequencing. We sequenced the libraries in 150-bp paired-end mode on HiSeq Illumina. We obtained around 10 million reads per library, except for the IgG negative control, which usually has a lower representation (S9 Table).

Briefly, we trimmed paired-end reads using trimgalore (v0.4.4) and assessed read quality with FASTQC. We mapped reads against the a heterochromatin-enriched genome assembly (Chang and al 2022) with bwa (v7.4) using the BWA-MEM algorithm (default parameters). We converted the resulting sam alignment files into bam files and sorted using respectively samtools (v1.11) view and sort command. We removed PCR duplicates using Markduplicates from Picardtools (v2.12.0) (https://broadinstitute.github.io/picard/). Because we are working with highly repetitive sequences, we analyzed both the unique and multi-mapping reads. We thus performed two different filtering based on mapping quality using samtools view [84]. To include multi-mapping reads, we use the following parameters: -b -h -f 3 -F 4 -F 8 -F 256 -F 2048. To keep only the uniquely mapping reads we use the following parameters: -b -h -f 3 -F 4 -F 8 -F 256 -F 2048 -q30. We estimated read coverage using the bamCoverage command from deeptools (v3.5.1) and normalized the read coverage to RPM (reads per million). The mappability score was calculated using GenMap (https://github.com/cpockrandt/genmap) with 150-mers to mimic the read length. The mappability is output in a bedgraph format that can be visualize using IGV. We called peaks based on fragment size using MACS2 callpeak (v2017-10-26) on both multi and uniquely mapping reads. The output files are tab delimited files.

G2/Jockey-3 evolutionary analyses

We identified G2/Jockey-3 sequences with two complementary methods. First, we annotated each genome assembly with our custom Drosophila TE library including the D. melanogaster G2/Jockey-3 consensus sequence using Repeatmasker v4.1.0. The annotations and 500 bp flanking regions were extracted with BEDTools v2.29.0 and aligned with MAFFT to generate a species-specific consensus sequence with Geneious v.8.1.6. Each assembly was annotated again using Repeatmasker with the appropriate species-specific G2/Jockey-3 consensus sequence. Second, we constructed de novo repeat libraries for each species with RepeatModeler2 v.2.0.1 and identified candidate G2/Jockey-3 sequences which shared high similarity with G2/Jockey-3 in D. melanogaster identified with BLAST v.2.10.0. We did the same with Jockey-1 (LINEJ1_DM) as confirmation of our methods, and to use it as an outgroup for the TE fragment alignment. We removed candidates shorter than 100 bp from the analysis. We identified ORFs within consensus TE sequences with NCBI ORFfinder. We used Repeatmasker to annotate the genome assemblies with the de novo Jockey-3 consensus sequences. To infer a phylogenetic tree of TEs, we aligned G2/Jockey-3 fragments identified in each species with MAFFT and retained sequences corresponding to the ORF bounds of the consensus sequences; We removed ORF fragments <400 bp. We inferred the tree with RAxML v.8.2.11 using the command “raxmlHPC-PTHREADS -s alignment_Jockey-3_melsimyak_400_ORF2_mafft.fasta -m GTRGAMMA -T 24 -d -p 12345 -# autoMRE -k -x 12345 -f a”.

Files and variables

File: Coverage.zip

Description: This directory contains Bigwig files that are indexed binary format. They can be opened in IGV software to visualize the coverage track along the genome. We used them as input in our R script to make plots, using the package karyoploteR and the command kpPlotBigWig.

These coverage files were used to make Figure 1ADG-2-3-4AB-5-S1-S2-S2-S5. The R scripts used to make the figures can be found here : https://github.com/LarracuenteLab/SimClade_Centromere_2024.github and in the directory SimClade_Centromere_2024-main in this repository.

The first part of the file name corresponds to the species : Maur (D. mauritiana), N25 (D. melanogaster), Sech (D. sechellia), XD1 (D. simulans). The second part corresponds to the Antibody used ("CID", "H2K9me3", "IgG"), they are 3 replicate for the CID (or CENP-A) antibody (CID1, CID2, CID3). The last part of the file name indicates whether we kept multi-mapping reads ("q30_RPM.bw") or only report uniquely mapping reads ("q0_RPM.bw"). For example "Maur_CID1_bwa_q0_RPM.bw" corresponds to normalized uniquely-mapping read coverage of the first replicate of CENP-A CUT&Tag data for D. mauritiana.

This folder also contains the mappability track for D. simulans, D. sechellia and D. mauritiana. The mappability score was calculated using GenMap (https://github.com/cpockrandt/genmap) with 150-mers to mimic the read length. The mappability is output in a bedgraph format that can be visualized using IGV.

File: Macs2_Peaks.zip

Description: We called peaks based on fragment size using MACS2 callpeak (v2017-10-26) on both multi and uniquely mapping reads. The output files are tab delimited files in this directory.
The files are composed of 10 columns :

Name of the chromosome
The starting position of the feature in the chromosome or scaffold
The ending position of the feature in the chromosome or scaffold
Peak name
Score base on signal value (0-1000)
strand - +/-
signalValue - Measurement of overall (usually, average) enrichment for the region.
pValue
qValue
peak - Point-source called for this peak; 0-based offset from chromStart. Use -1 if no point-source called.

The first part of the file name corresponds to the species : Maur (D. mauritiana), N25 (D. melanogaster), Sech (D. sechellia), XD1 (D. simulans). The second part corresponds to the Antibody used ("CID", "H2K9me3", "IgG"), they are 3 replicate for the CID (or CENP-A) antibody (CID1, CID2, CID3). The last part of the third name indicates whether we kept multi-mapping reads ("bwa_q30") or only report uniquely mapping reads ("bwa_q0"). For example "Maur_CID1_bwa_q0_macs2_peaks.narrowPeak" corresponds to peak localization of the multi-mapping read of the first replicate of CENP-A CUT&Tag data for D. mauritiana. These narrowPeak files can be visualized with a took like Integrative Genomics Viewer (IGV; https://igv.org/) or by plotting in R. In Courret et al. 2024, we used the R scripts in Code/CoveragePlot directory to plot these peaks.

File: Phylogeny.zip

Description: This directory contains the newick format files to generate the phylogenetic tree for the G2/Jockey-3 fragments identified in each species. We made these trees with MAFFT and retained sequences corresponding to the ORF bounds of the consensus sequences; We removed ORF fragments <400 bp. We generated the following two treefiles with RAxML v.8.2.11- one for all individual fragments <400bp in each species and one for the consensus TEs for each species:

-RAxML_bipartitions.alignment_jockey_family_Final.automre - all fragments in each species
-RAxML_bipartitions.alignment_Jockey-3_melsimyak_400_ORF2_mafft_Jockey-1_Dse.automre - consensus TEs for each species

These automre files can be used to draw phylogenetic trees with ggtree package, a phylogenetic illustration package extension for ggplot. Additional packages including ape were used to import the NEWICK phylogenetic tree files and other tidyverse packages. The specific R scripts used in the paper are under Code/G2-Jockey-3_evolution and can be run from the command line, changing the path of the AUTOMRE files as needed within the R files. For all fragments use Jockey-3_melsimyak_ORF2_PhylogeneticTree.R. For consensus TEs, use Jockey_family_PhylogeneticTree.R.

Code

This repository contains the R code and the intermediate files necessary to make the Figures, and the bash script used to analyze the raw data. R scripts are organized by Figure type. Unpacking this file will reveal subdirectories containing code for four types of plots and the data analysis pipeline as follows:

CoveragePlot

Description: This directory contains all the R scripts necessary to make Figures 1ADG-2-3-4AB-5-S1-S2-S2-S5. The R script require the library karyoploteR, regioneR, GenomicRanges, rtracklayer, IRanges, devtools, stringr. This folder contain a subfolder named "Reference" that contain the genome object (chromosome.size and "cytoband") necessary to make those figures.

G2-Jockey-3_evolution

Description: This directory contains R scripts necessary to create phylogenetic trees. It requires the following libraries: ape, phytools, geiger, evobiR, ggplot2, dplyr, tidyr and ggtree. It also requires the tree files RAxML_bipartitions.alignment_jockey_family_Final.automre and RAxML_bipartitions.alignment_Jockey-3_melsimyak_400_ORF2_mafft_Jockey-1_Dse.automre, located in the Phylogeny directory. For all fragments use Jockey-3_melsimyak_ORF2_PhylogeneticTree.R with RAxML_bipartitions.alignment_jockey_family_Final.automre as input. For consensus TEs, use Jockey_family_PhylogeneticTree.R with RAxML_bipartitions.alignment_Jockey-3_melsimyak_400_ORF2_mafft_Jockey-1_Dse.automre as input.

IDR

Description: This directory contains the R scripts necessary to make Figure S4. This directory also contains the file necessary to make this figure: the chrome.size object and the q0.Top-idr (tab delimited file containing the top peaks conserved between replicates).

TE_enrichment_Plot:

Description: This directory contains the R script necessary to make Figure 1BEH. This directory also contains the file necessary to make this figure: the "CID_RPM_Top_newname.out" is tab delimited file that contain the top most enriched TE.

Pipeline

Description : This directory contains the bash script used to analyzed the CUT&Tag data.

script: CutandTag.sh - performs the analysis of the CUT&Tag raw data to obtain the coverage plots. (for Figure 1ADG-2-3-4AB-5-S1-S2-S2-S5)

script: CutandTag_TEcount.sh - estimates the TE enrichment (for Figure 1BEH)

script: CutandTag_peakCalling.sh - does the peak calling (for Figure 1ADG-2-3-4AB-5-S1-S2-S2-S5).

script: Blast_Sat.sh - identifies the satellite locations in the genome using Blast (for Figure S7).

Access information

Other publicly accessible locations of the data:

All the BASH pipelines and R scripts used in this study are also available on github: https://github.com/LarracuenteLab/SimClade_Centromere_2024.

Data was derived from the following sources:

All sequences are available from NCBI SRA under Bioproject accession PRJNA1007690.

Overview:

This repository contains data and code used in Courret et al. 2024 (https://doi.org/10.1371/journal.pbio.3002911).

CUT&Tag methods:

G2/Jockey-3 evolutionary analyses

We identified G2/Jockey-3 sequences with two complementary methods. First, we annotated each genome assembly with our custom Drosophila TE library including the D. melanogaster G2/Jockey-3 consensus sequence using Repeatmasker v4.1.0. The annotations and 500 bp flanking regions were extracted with BEDTools v2.29.0 and aligned with MAFFT to generate a species-specific consensus sequence with Geneious v.8.1.6. Each assembly was annotated again using Repeatmasker with the appropriate species-specific G2/Jockey-3 consensus sequence. Second, we constructed de novo repeat libraries for each species with RepeatModeler2 v.2.0.1 and identified candidate G2/Jockey-3 sequences which shared high similarity with G2/Jockey-3 in D. melanogaster identified with BLAST v.2.10.0. We did the same with Jockey-1 (LINEJ1_DM) as confirmation of our methods, and to use it as an outgroup for the TE fragment alignment. We removed candidates shorter than 100 bp from the analysis. We identified ORFs within consensus TE sequences with NCBI ORFfinder. We used Repeatmasker to annotate the genome assemblies with the de novo Jockey-3 consensus sequences. To infer a phylogenetic tree of TEs, we aligned G2/Jockey-3 fragments identified in each species with MAFFT and retained sequences corresponding to the ORF bounds of the consensus sequences; We removed ORF fragments <400 bp. We inferred the tree with RAxML v.8.2.11 using the command “raxmlHPC-PTHREADS -s alignment_Jockey-3_melsimyak_400_ORF2_mafft.fasta -m GTRGAMMA -T 24 -d -p 12345 -# autoMRE -k -x 12345 -f a”.

Turnover of retroelements and satellite DNA drives centromere reorganization over short evolutionary timescales in Drosophila

Data files

Abstract

README: Rapid turnover of centromeric DNA reveals signatures of genetic conflict in Drosophila

Description of the data and file structure

Files and variables

File: Coverage.zip

File: Macs2_Peaks.zip

File: Phylogeny.zip

Code

Access information

Methods