Data from: Functional monocentricity with holocentric characteristics and chromosome-specific centromeres in a stick insect

Toubiana, William 1 ; Dumas, Zoé1; Tran Van, Patrick2; Parker, Darren3; Mérel, Vincent1; Schubert, Veit4; Aury, Jean-Marc5; Bournonville, Lorène6; Cruaud, Corinne5; Houben, Andreas4; Istace, Benjamin5; Labadie, Karine5; Noel, Benjamin5; Schwander, Tanja 1

Published Dec 03, 2024 on Dryad. https://doi.org/10.5061/dryad.hx3ffbgph

Data files

Dec 03, 2024 version files 2.55 MB

Git_Tdi_centromere_paper-main.zip

2.54 MB
README.md

2.93 KB

Abstract

Centromeres are essential for chromosome segregation in eukaryotes, yet their specification is surprisingly diverse among species, and can involve major transitions such as those from localized to chromosome-wide centromeres between monocentric and holocentric species. How this diversity evolves remains elusive. We discovered within-cell variation in the recruitment of the major centromere protein CenH3, reminiscent of variation typically observed among species. While CenH3-containing nucleosomes are distributed in a monocentric fashion on autosomes and bind tandem repeat sequences specific to individual or groups of chromosomes, they show a longitudinal distribution and broad intergenic binding on the X chromosome, which partially recapitulates phenotypes known from holocentric species. Despite this variable CenH3 distribution among chromosomes, all chromosomes are functionally monocentric, marking the first instance of a monocentric species with chromosome-wide CenH3 deposition. Together, our findings illustrate a potential transitional state between mono- and holocentricity or towards CenH3-independent centromere determination, and help to understand the rapid centromere sequence divergence between species.

This repository contains the codes used for analysing centromere sequences in the "Functional monocentricity with holocentric characteristics and chromosome-specific centromeres in a stick insect" article.

Input data files for executing the codes are available upon request.

Documentation

script_chip_tdi_paper_final.sh describes the general pipeline to analyse the centromere sequences in T. douglasi.

The separated folders include the different scripts used in the overall pipeline:

TE_annotation

script_transposable_element_annotation.Rmd: annotates transposable elements in the T. douglasi genome assembly.

TR annotation and minimal rotations

script_tandem-repeat_annotation.sh: annotates tandem repeat sequences in the T. douglasi genome assembly.

script_minimal_rotation_parse.pl: orders every repeated motif sequence alphabetically.

kmer_approach

config_xla_merge_final_SE_v9_yf_genome10.2.yaml: configuration file for the kmer approach

snakefile_v9_genomev10PRE_SE.py: python script to identify CenH3-enriched k-mer motifs.

Levenshtein distances

script_levenshtein_rotations_F-F_inputFiles.py: computes pair-wise levenstein distances between motif sequences.

script_levenshtein_rotations_F-R_inputFiles.py: computes pair-wise levenstein distances between motif sequences and their reverse complements.

Rscripts

TRF_parsing.R: parses gff3 file obtained from tandem repeat annotation (see script_tandem-repeat_annotation.sh)

Proportion_categories.R: estimates proportion and enrichment of sequence categories annotated in the T. douglasi genome assembly.

Enriched_windows.R: selects 10kb windows based on coverage ratio of CenH3-ChIP to input

Enriched_minimal_rotation_TR_motifs.R: extract and duplicates motif sequences with minimal rotations to compute Levenshtein distances

Levenstein_network.R: builds network of sequence similarities among tandem repeat motifs identified in the genome assembly.

kmer_minimal_rotation_TR_motifs.R: extract and duplicates motif sequences identified in the de novo contigs with minimal rotations to compute Levenshtein distances.

Levenstein_network_contigs.R: builds network of sequence similarities among tandem repeat motifs identified in the de novo contigs (k-mer approach).

Heatmap_TRF-based.R: creates a heatmap with hierarchical clustering based on total array length inferred by summing Tandem Repeat Finder array lengths per repeat family

Heatmap_blast-based.R: creates a heatmap with hierarchical clustering based on total array length inferred by summing the lengths of sequence motif blast hits with 80% sequence similarity and 80% query coverage.

Genome_assembly

Genome assembly pipeline for T. douglasi.

Sex_chr_ID

Scripts to ID the X chromosome in T. douglasi.

Gene_annotation

Gene annotation scripts for the T. douglasi genome.