Heterochromatin-dependent transcription of satellite DNAs in the Drosophila melanogaster female germline
Wei, Xiaolu; Eickbush, Danna G.; Speece, Iain; Larracuente, Amanda M. (2021), Heterochromatin-dependent transcription of satellite DNAs in the Drosophila melanogaster female germline, Dryad, Dataset, https://doi.org/10.5061/dryad.hdr7sqvj3
Large blocks of tandemly repeated DNAs—satellite DNAs (satDNAs)—play important roles in heterochromatin formation and chromosome segregation. We know little about how satDNAs are regulated, however their misregulation is associated with genomic instability and human diseases. We use the Drosophila melanogaster germline as a model to study the regulation of satDNA transcription and chromatin. Here we show that complex satDNAs (>100-bp repeat units) are transcribed into long noncoding RNAs and processed into piRNAs (PIWI interacting RNAs). This satDNA piRNA production depends on the Rhino-Deadlock-Cutoff complex and the transcription factor Moonshiner—a previously-described non-canonical pathway that licenses heterochromatin-dependent transcription of dual-strand piRNA clusters. We show that this pathway is important for establishing heterochromatin at satDNAs. Therefore, satDNAs are regulated by piRNAs originating from their own genomic loci. This novel mechanism of satDNA regulation provides insight into the role of piRNA pathways in heterochromatin formation and genome stability.
Data analysis methods: Information on the data analysis is below and a full description of materials in methods is in the accompanying paper by Wei et al.
Satellite DNA analysis: Reads were mapped to the heterochromatin-enriched genome assembly (Chang and Larracuente 2019) and counted based on their annotations (e.g. Rsp or 1.688). Due to the highly repetitive nature of satDNAs, around 80% of total RNA-seq and 99% of small RNA-seq reads that are mapped to satDNA regions are not uniquely assigned; discarding these multiple mapped reads would result in loss of statistical power in the satDNA analysis. To deal with this, multiple mapped reads were randomly assigned to one of their multiple best mapping locations, unless stated otherwise. Reads were then counted based on the annotations of their assigned mapping locations. Because there is high sequence similarity among the 1.688 subfamily repeats (260-bp, 359-bp, 353-bp, 356-bp), all 1.688 subfamilies were combined, unless stated otherwise. A similar approach was used in our analysis of piRNA clusters, except that only uniquely mapped reads were counted so that the published results could serve as controls for our method. Additional details specific to small RNA-seq, RNA-seq, ChIP-seq, and RIP-seq analyses are given below.
Total RNA-seq analysis: Total RNA-seq reads were trimmed for adaptors and then mapped to the genome using Bowtie2 (RRID:SCR_016368) (Langmead and Salzberg 2012). A customized python script was used to count reads that mapped to each repeat feature or piRNA cluster, and RPM values were reported by normalizing raw counts to 1,000,000 total mapped reads (htseq_bam_count_proportional.py). For the 1.688 subfamilies, all subfamilies were combined into one 1.688 category, although analyzing each by subfamily (e.g. 353-bp, 356-bp, 359-bp, 260-bp) does not change our conclusions. For results shown in Supplementary File 8, DESeq2 (RRID:SCR_015687) (Love, et al. 2014) was used to perform differential expression analysis of the raw counts with combined data from different studies (Mohn, et al. 2014; Andersen, et al. 2017), with experimental condition and associated study as covariates. This analysis method is conservative and leads to smaller log2 fold changes than published results of piRNA clusters. For comparison with the published results, a similar approach was used to analyze piRNA clusters (Mohn, et al. 2014; Andersen, et al. 2017). Briefly, quantification of reads mapping to 1-kb windows inside each piRNA cluster was estimated using a customized python script (https://github.com/LarracuenteLab/Dmelanogaster_satDNA_regulation (htseq_bam_count_proportional.py), and subsequent differential expression analysis between mutants and wildtype was done using DESeq2 (RRID:SCR_015687) (Love, et al. 2014; results shown in Wei et al. Figure 3–figure supplement 2).
Small RNA-seq analysis:Small RNA-seq reads were trimmed for adaptors, then mapped to the genome using Bowtie (RRID:SCR_005476) (Langmead 2010). A customized python script (htseq_bam_count_proportional.py) was used to count reads that mapped to each repeat feature or piRNA cluster. To control for differences in small RNA abundance and compare across samples, raw counts were then normalized to the number of reads that mapped to either miRNAs or the flamenco piRNA cluster. The difference in expression was represented by the log2 fold changes of these normalized counts in mutants compared to wild type (i.e., log2(countmutant/countWT)) for each repeat and piRNA cluster. We further analyzed the size distribution and relative nucleotide bias at positions along each satDNA by extracting reads mapped to the satDNA of interest using a customized python script (extract_sequence_by_feature_gff.py). The 10nt overlap Z-score of piRNAs mapped to each satDNA was calculated using piPipes (Han, et al. 2015). To determine which parts of these repeats are represented in piRNA or ChIP datasets, the read pileup patterns along the consensus sequence of a satDNA were examined (e.g. Wei et al. Figure 2–figure supplement 2). Reads (ChIP or piRNA) mapping to a particular satDNA or genomic satDNA variant (as a control) were BLAST-ed to the consensus dimer (for 1.688 satellite) or trimer (for Rsp because it has left and right consensus sequences), and then coordinates were converted along a dimer/trimer to coordinates along a monomer/dimer consensus sequence.
ChIP/RIP-seq analysis: ChIP-seq and RIP-seq reads were trimmed for adaptors and mapped to the genome using Bowtie2 (RRID:SCR_016368) (Langmead and Salzberg 2012). The script htseq_bam_count_proportional.py was used to count reads that mapped to each repeat feature or piRNA cluster. Raw counts were normalized to 1,000,000 total mapped reads.
For the ChIP-seq results, enrichment scores of each repeat and piRNA cluster were reported by comparing the ChIP sample with the antibody of interest to its no-antibody input control sample. For ChIP-seq analyses, consider satDNA as discrete loci rather than repeat unit types is appropriate because some loci are composed of several repeat types. To examine the large blocks of heterochromatic satDNA chromatin for the Rhi and H3K9me3 ChIP-seq analyses, euchromatic 1.688 satDNAs were excluded and only reads that map uniquely to satDNA loci were analyzed. These analyses were repeated by combining all 1.688 subfamilies into a single category, and each subfamily was analyzed separately (e.g. all 353-bp repeats combined) but the conclusions do not change. Euchromatic controls are included for the Rhi and H3K9me3 ChIP-seq analyses. Here, the euchromatic control correspond to the median enrichment score for protein coding genes that are 5Mb distal from heterochromatin boundaries (Riddle, et al. 2011) and piRNA clusters.
Replicate numbers for all analyses were decided based on data availability. All ChIP-seq/RNA-seq/small RNA-seq analysis were done with 2-6 replicates with data from multiple studies, except some (in Wei et al. Figure3-figure supplement3) have 1 replicate due to data availability. All qRT-PCR experiments had 3-5 biological replicates, and qPCR experiments had 2-4 biological replicates.
Please see supplementary file 1 from Wei et al. for a complete list, including accession numbers, of the datasets used in this study.
All data files and code to recreate analyses and figures are in this repository and are also deposited in GitHub (https://github.com/LarracuenteLab/Dmelanogaster_satDNA_regulation).
Preprint associated with this dataset: Wei,X, Eickbush,D.G., Speece,I., A.M. Larracuente. 2020. Heterochromatin-dependent transcription of satellite DNAs in the Drosophila melanogaster female germline. BIORXIV/2020/268920 8/28/20. Doi: 10.1101/2020.08.26.268920
The following repository contains code for data analysis and figures in Wei et al. In this paper, we show that complex satDNAs (>100-bp repeat units) are transcribed into long noncoding RNAs and processed into piRNAs (PIWI interacting RNAs). This satDNA piRNA production depends on the Rhino-Deadlock-Cutoff complex and the transcription factor Moonshiner—a previously-described non-canonical pathway that licenses heterochromatin-dependent transcription of dual-strand piRNA clusters. We show that this pathway is important for establishing heterochromatin at satDNAs. Therefore, satDNAs are regulated by piRNAs originating from their own genomic loci. This novel mechanism of satDNA regulation provides insight into the role of piRNA pathways in heterochromatin formation and genome stability.
Please direct questions to Xiaolu Wei at firstname.lastname@example.org or Amanda Larracuente at email@example.com, or code author in each script. We organize the data files and scripts in this repository according to the figures or supplementary files in Wei et al. We also include scripts and general data files that are not associated with specific figures, including:
#count.sh: bash script used to count and extract reads that mapped to repeats from alignment files. Example usage: count.sh
#extract_sequence_by_feature_gff.py: python script used to extract reads that map to a repeat feature of interest and output fastq format. Example usage: python3 extract_sequence_by_feature_gff.py -t "RepeatName" alignment.bam annotation.gff output.fq
#htseq_bam_count_proportional.py: python script used to count reads mapped to repeat features from alignment files. Example usage: python3 htseq_bam_count_proportional.py alignment.bam annotation.gff count.out
#dmel_scaffold2_plus0310_rm_for_htseq.gff3: repeat annotation file for our heterochromatin enriched assembly from Chang and Larracuente 2019 (https://doi.org/10.1534/genetics.118.301765).
#different.1pt688.anaylsis: We performed analysis of the 1.688 satDNA repeats for ChIP-seq, RIP-seq and RNA-seq data in different ways to make sure that different analysis methods do not affect our conclusions. The results for the different analysis methods are organized by data type in this folder. Files with suffix "_subfamily" means that each 1.688 subfamily analyzed separately. Files with suffix "_combined" means that all 1.688 subfamilies combining into a single category named 1.688. Files with suffix "_each_locus" means that each 1.688 heterochromatic locus analyzed separately with uniquely mapped reads or all mapped reads.
Usage: Please see scripts for further usage instructions.
National Institutes of Health, Award: R35 GM119515