Data from: Functional monocentricity with holocentric characteristics and chromosome-specific centromeres in a stick insect
Data files
Dec 03, 2024 version files 2.55 MB
-
Git_Tdi_centromere_paper-main.zip
2.54 MB
-
README.md
2.93 KB
Abstract
Centromeres are essential for chromosome segregation in eukaryotes, yet their specification is surprisingly diverse among species, and can involve major transitions such as those from localized to chromosome-wide centromeres between monocentric and holocentric species. How this diversity evolves remains elusive. We discovered within-cell variation in the recruitment of the major centromere protein CenH3, reminiscent of variation typically observed among species. While CenH3-containing nucleosomes are distributed in a monocentric fashion on autosomes and bind tandem repeat sequences specific to individual or groups of chromosomes, they show a longitudinal distribution and broad intergenic binding on the X chromosome, which partially recapitulates phenotypes known from holocentric species. Despite this variable CenH3 distribution among chromosomes, all chromosomes are functionally monocentric, marking the first instance of a monocentric species with chromosome-wide CenH3 deposition. Together, our findings illustrate a potential transitional state between mono- and holocentricity or towards CenH3-independent centromere determination, and help to understand the rapid centromere sequence divergence between species.
README: Centromere sequence identification in Timema douglasi
This repository contains the codes used for analysing centromere sequences in the "Functional monocentricity with holocentric characteristics and chromosome-specific centromeres in a stick insect" article.
Input data files for executing the codes are available upon request.
Documentation
script_chip_tdi_paper_final.sh describes the general pipeline to analyse the centromere sequences in T. douglasi.
The separated folders include the different scripts used in the overall pipeline:
TE_annotation
script_transposable_element_annotation.Rmd: annotates transposable elements in the T. douglasi genome assembly.
TR annotation and minimal rotations
script_tandem-repeat_annotation.sh: annotates tandem repeat sequences in the T. douglasi genome assembly.
script_minimal_rotation_parse.pl: orders every repeated motif sequence alphabetically.
kmer_approach
config_xla_merge_final_SE_v9_yf_genome10.2.yaml: configuration file for the kmer approach
snakefile_v9_genomev10PRE_SE.py: python script to identify CenH3-enriched k-mer motifs.
Levenshtein distances
script_levenshtein_rotations_F-F_inputFiles.py: computes pair-wise levenstein distances between motif sequences.
script_levenshtein_rotations_F-R_inputFiles.py: computes pair-wise levenstein distances between motif sequences and their reverse complements.
Rscripts
TRF_parsing.R: parses gff3 file obtained from tandem repeat annotation (see script_tandem-repeat_annotation.sh)
Proportion_categories.R: estimates proportion and enrichment of sequence categories annotated in the T. douglasi genome assembly.
Enriched_windows.R: selects 10kb windows based on coverage ratio of CenH3-ChIP to input
Enriched_minimal_rotation_TR_motifs.R: extract and duplicates motif sequences with minimal rotations to compute Levenshtein distances
Levenstein_network.R: builds network of sequence similarities among tandem repeat motifs identified in the genome assembly.
kmer_minimal_rotation_TR_motifs.R: extract and duplicates motif sequences identified in the de novo contigs with minimal rotations to compute Levenshtein distances.
Levenstein_network_contigs.R: builds network of sequence similarities among tandem repeat motifs identified in the de novo contigs (k-mer approach).
Heatmap_TRF-based.R: creates a heatmap with hierarchical clustering based on total array length inferred by summing Tandem Repeat Finder array lengths per repeat family
Heatmap_blast-based.R: creates a heatmap with hierarchical clustering based on total array length inferred by summing the lengths of sequence motif blast hits with 80% sequence similarity and 80% query coverage.
Genome_assembly
Genome assembly pipeline for T. douglasi.
Sex_chr_ID
Scripts to ID the X chromosome in T. douglasi.
Gene_annotation
Gene annotation scripts for the T. douglasi genome.