Data from: Functional monocentricity with holocentric characteristics and chromosome-specific centromeres in a stick insect
Data files
Dec 03, 2024 version files 2.55 MB
-
Git_Tdi_centromere_paper-main.zip
2.54 MB
-
README.md
2.93 KB
Abstract
Centromeres are essential for chromosome segregation in eukaryotes, yet their specification is surprisingly diverse among species, and can involve major transitions such as those from localized to chromosome-wide centromeres between monocentric and holocentric species. How this diversity evolves remains elusive. We discovered within-cell variation in the recruitment of the major centromere protein CenH3, reminiscent of variation typically observed among species. While CenH3-containing nucleosomes are distributed in a monocentric fashion on autosomes and bind tandem repeat sequences specific to individual or groups of chromosomes, they show a longitudinal distribution and broad intergenic binding on the X chromosome, which partially recapitulates phenotypes known from holocentric species. Despite this variable CenH3 distribution among chromosomes, all chromosomes are functionally monocentric, marking the first instance of a monocentric species with chromosome-wide CenH3 deposition. Together, our findings illustrate a potential transitional state between mono- and holocentricity or towards CenH3-independent centromere determination, and help to understand the rapid centromere sequence divergence between species.
This repository contains the codes used for analysing centromere sequences in the "Functional monocentricity with holocentric characteristics and chromosome-specific centromeres in a stick insect" article.
Input data files for executing the codes are available upon request.
Documentation
script_chip_tdi_paper_final.sh describes the general pipeline to analyse the centromere sequences in T. douglasi.
The separated folders include the different scripts used in the overall pipeline:
TE_annotation
script_transposable_element_annotation.Rmd: annotates transposable elements in the T. douglasi genome assembly.
TR annotation and minimal rotations
script_tandem-repeat_annotation.sh: annotates tandem repeat sequences in the T. douglasi genome assembly.
script_minimal_rotation_parse.pl: orders every repeated motif sequence alphabetically.
kmer_approach
config_xla_merge_final_SE_v9_yf_genome10.2.yaml: configuration file for the kmer approach
snakefile_v9_genomev10PRE_SE.py: python script to identify CenH3-enriched k-mer motifs.
Levenshtein distances
script_levenshtein_rotations_F-F_inputFiles.py: computes pair-wise levenstein distances between motif sequences.
script_levenshtein_rotations_F-R_inputFiles.py: computes pair-wise levenstein distances between motif sequences and their reverse complements.
Rscripts
TRF_parsing.R: parses gff3 file obtained from tandem repeat annotation (see script_tandem-repeat_annotation.sh)
Proportion_categories.R: estimates proportion and enrichment of sequence categories annotated in the T. douglasi genome assembly.
Enriched_windows.R: selects 10kb windows based on coverage ratio of CenH3-ChIP to input
Enriched_minimal_rotation_TR_motifs.R: extract and duplicates motif sequences with minimal rotations to compute Levenshtein distances
Levenstein_network.R: builds network of sequence similarities among tandem repeat motifs identified in the genome assembly.
kmer_minimal_rotation_TR_motifs.R: extract and duplicates motif sequences identified in the de novo contigs with minimal rotations to compute Levenshtein distances.
Levenstein_network_contigs.R: builds network of sequence similarities among tandem repeat motifs identified in the de novo contigs (k-mer approach).
Heatmap_TRF-based.R: creates a heatmap with hierarchical clustering based on total array length inferred by summing Tandem Repeat Finder array lengths per repeat family
Heatmap_blast-based.R: creates a heatmap with hierarchical clustering based on total array length inferred by summing the lengths of sequence motif blast hits with 80% sequence similarity and 80% query coverage.
Genome_assembly
Genome assembly pipeline for T. douglasi.
Sex_chr_ID
Scripts to ID the X chromosome in T. douglasi.
Gene_annotation
Gene annotation scripts for the T. douglasi genome.
