Data from: Topological mixing and irreversibility in animal chromosome evolution
Data files
Apr 22, 2026 version files 1.24 GB
-
BCnSSimakov2022_current_rbh_202509.tar.gz
568.04 MB
-
checksums.md5
146 B
-
extract_data.sh
3 KB
-
newick_and_timetree_20251118.tar.gz
642.58 MB
-
README.md
37.22 KB
-
SupplementaryTable_16.xlsx
30.59 MB
Abstract
Animal chromosome homology can persist over hundreds of millions of years, in spite of fusions and translocations. The frequency and pace of these major genomic changes, as well as their impact, remain unclear. Using a multi-scale manifold representation of pan-animal genome homology, we compare 3,631 whole chromosomal genomes spanning all major animal clades. This ‘evolutionary genome topology’ reveals irreversible genomic configurations at both chromosomal and subchromosomal levels. We show that accumulation of such states over macro-evolutionary time scales results in distinct evolutionary trajectories with intermediate positions associated with major phylogenetic groups emergence. We show that key developmental gene loci are impacted by this process, associating regulatory innovations with major evolutionary transitions.
This repository contains data from multiple related studies investigating chromosome evolution across Metazoa using ancestral linkage group (ALG) analyses and phylogenetic methods. This dataset contains pairs of BCnS loci that are stable, unstable, or close in a selection of clades. This is Supplementary Table 16 in the manuscript https://doi.org/10.1101/2024.07.29.605683
- Original DOI: https://doi.org/10.5061/dryad.m905qfvbv
Brief Description
This dataset includes:
- Topological mixing and irreversibility analysis (
SupplementaryTable_16.xlsx): Data on clade-defining locus pairs that demonstrate topological linkage and mixing in animal chromosome evolution - Comprehensive phylogenetic analysis workflow (
newick_and_timetree_20251118/): Complete analytical pipeline including phylogenetic trees, chromosome evolution analyses, and temporal pattern detection across 4,454 animal genomes - BCnS Reciprocal Best Hits database (
BCnSSimakov2022_current_rbh_202509/): Orthology data files (5,821 RBH files, 3.6GB) generated by the ODP pipeline, containing reciprocal best hit information for all analyzed species. These files form the basis of downstream analyses of chromosome evolution in this manuscript.
Data Extraction Instructions
This dataset is distributed as compressed archives to comply with Dryad's file limits.
Quick Start
After downloading all files from Dryad, extract the data:
bash extract_data.sh
This will:
- Verify file integrity (checksums)
- Extract
BCnSSimakov2022_current_rbh_202509/(5,821 RBH files, 3.6 GB) - Extract
newick_and_timetree_20251118/(11,101 workflow files, 4.7 GB) - Verify successful extraction
Manual Extraction
If you prefer to extract manually:
# Extract RBH database
tar -xzf BCnSSimakov2022_current_rbh_202509.tar.gz
# Extract phylogenetic workflow
tar -xzf newick_and_timetree_20251118.tar.gz
Verify Checksums (Optional)
md5sum -c checksums.md5
Expected Directory Structure After Extraction
.
├── BCnSSimakov2022_current_rbh_202509/ (5,821 files, 3.6 GB)
├── newick_and_timetree_20251118/ (11,101 files, 4.7 GB)
├── README.md (this file)
├── SupplementaryTable_16.xlsx
├── extract_data.sh
└── checksums.md5
Part 1: Topological Mixing and Irreversibility in Animal Chromosome Evolution
Description of the data and file structure
Definitions of clade-defining locus pairs
Pairs of loci can define the archetypal genome for a clade, and are topologically linked, if (a) the pairs are unique to this clade - that is these two sequences are only present on the same chromosome within genomes in this clade, (b) the pairs are fixed at a stable distance in this clade, regardless of whether they are close or distant, or (c) the distances between the pairs are very small in genomes in this clade (cis- interactions). It is therefore possible that type 3 pairs (small inter-locus distance, or cis-) pairs are also type 2 pairs (stable inter-locus distance), but not vice versa. Type 3 pairs can generally be considered to reflect widely studied micro-syntenic regions, or constraints that act on very short (cis-) distances. Such regions can be easily identified via comparisons of both chromosome- and non chromosome-scale genomes (14, 36, 90). Type 2 pairs, on the other hand, can only be uncovered by averaging dozens to hundreds of chromosome-scale genomes. Type 2 pairs, therefore, may reflect more distal entanglement (supplementary text). Extensive mixing, leading to entangled pairs of many gene enhancer-promoter regulatory links within a defined genomic region (e.g., a topological domain, can be seen to make the domain 'resistant' to random translocations that would break one or more of such regulatory links.
There are many locus pairs that are characteristically stable (type 2) or close (type 3) in any given clade. For example, in Spiralia, there are 2,398 pairs stable in the clade, and 2,471 especially close pairs. While the test for type 1 pairs is binary, the stringency of our search for type 2 and type 3 pairs (one-sided μ-2σ cutoffs, ≥ 50% of species with this pair, Methods) means that these numbers are conservative estimates of the number of identifiable markers. We note that the most strongly stable pairs (type 2) are also close pairs (type 3), and these novel linkages appear to be rarely lost in genomes within the clade once the loci become linked (fig. S10C and F). This property enables stable pairs to provide for a phylogenetic signal, as their retention could comprise irreversible states due to topological or regulatory features (Figure 1). We find 690 stable pairs shared in deuterostomes and 185 pairs in protostomes, with no overlap between the two sets. Similarly, out of 2398 spiralian stable pairs, 1743 and 610 are retained in mollusks and annelids, respectively. Whereas only 497 are shared between mollusks and annelids.
Files and variables
File: SupplementaryTable_16.xlsx
Description: The file contained in this dataset is Supplementary Table S16, the legend of which is reproduced below.
Variables
- Table S16. Individual pairs that define each clade. Key for columns:
- Column A (nodename): The name of the clade for which this pair of loci is being considered.
- Column B (taxid): The NCBI taxid for the clade listed in column A.
- Column C (ortholog1): One of the orthologs from the BCnS ALGs in the pair.
- Column D (rbh1): The ALG to which ortholog1 belongs.
- Column E (ortholog2): The other ortholog from the pair.
- Column F (rbh2): The ALG to which ortholog2 belongs.
- Column G (close_in_clade): 1 if the pair is especially close in this clade, defined by having an occupancy_in of >50% and a z-score of less than -2 for mean_in_out_ratio_log_sigma.
- Column H (distant_in_clade): Same as close_in_clade, but the z-score must be greater than 2 in mean_in_out_ratio_log_sigma.
- Column I (stable_in_clade): Same as close_in_clade, but the z-score must be less than -2 in the column sd_in_out_ratio_log_sigma.
- Column J (unstable_in_clade): The z-score of this pair must be greater than 2 in the column sd_in_out_ratio_log_sigma.
- Column K (unique_to_clade): 1 if this pair only appears in this clade in the dataset, and not outside of the clade.
- Column L (pair): An index that uniquely describes the pair, also recorded in columns D-F.
- Column M (notna_in): The number of genomes in this clade that had this pair present on the same chromosome.
- Column N (notna_out): The number of genomes outside of the clade that had this pair present on the same chromosome.
- Column O (mean_in): The mean distance between the two loci in this pair in all of the genomes in this clade.
- Column P (sd_in): The standard deviation of the distances between these loci in all the genomes in the clade.
- Column Q (mean_out): The mean distance between the two loci in this pair in all of the genomes outside the clade.
- Column R (sd_out): The standard deviation of the distances between these loci in all the genomes outside the clade.
- Column S (occupancy_in): The fraction of genomes in this clade in the dataset that have this pair of loci on the same chromosome.
- Column T (occupancy_out): The fraction of genomes outside this clade that have this pair of loci on the same chromosome.
- Column U (num_species_in): The number of genomes in this clade in this dataset.
- Column V (num_species_out): The number of genomes outside of this clade in this dataset.
- Column W (sd_in_out_ratio): The ratio of standard deviations of distances inside this clade versus outside the clade.
- Column X (sd_in_out_ratio_log): The log column W, sd_in_out_ratio.
- Column Y (mean_in_out_ratio): The ratio of the mean distances of the loci in genomes in the clade versus outside the clade.
- Column Z (mean_in_out_ratio_log): The log of column Y, mean_in_out_ratio.
- Column AA (sd_in_out_ratio_log_sigma): The z-score of column X when considered against the distribution of column X values for the same clade.
- Column AB (mean_in_out_ratio_log_sigma): The z-score of column Z when considered against the distribution of column Z values for the same clade.
Code/software
An Excel spreadsheet viewer is required to view this file. Options are Microsoft Excel (paid), Google Sheets (free but account required), Apple Pages (included with Mac OS), or Libre Office (free).
Part 2: BCnS Reciprocal Best Hits (RBH) Database
Directory: BCnSSimakov2022_current_rbh_202509/
Description
This directory contains the Reciprocal Best Hits (RBH) orthology database generated by the ODP (Oxford Dot Plot) pipeline. These files establish orthologous relationships between genes across all analyzed species and the BCnS (Branchiostoma and Ciona plus nematode and Simakov) ancestral linkage groups (ALGs).
Database Metrics:
- Total RBH files: 5,821
- Total size: 3.6 GB
- File format: Tab-separated values (TSV)
- Reference: ODP software https://github.com/conchoecia/odp
File Structure
File Naming Pattern:
BCnSSimakov2022_[SpeciesName]-[TaxID]-[Accession]_xy_reciprocal_best_hits.plotted.rbh
Examples:
BCnSSimakov2022_Abaxparallelepipedus-102642-GCA964197645.1_xy_reciprocal_best_hits.plotted.rbhBCnSSimakov2022_Zygaenaviciae-287404-GCA964271875.1_xy_reciprocal_best_hits.plotted.rbh
File Contents
Each RBH file contains reciprocal best hit orthology assignments between genes in a species genome and the BCnS ALG reference set. These files are used throughout the phylogenetic analysis workflow to:
- Identify ancestral linkage group (ALG) membership for genes
- Track ALG conservation and rearrangements across phylogeny
- Calculate dispersion and fusion metrics (Step 3)
- Generate perspective chromosome analyses (Step 4)
Usage in Workflow
These files are referenced in the analysis scripts via the placeholder <RBH_DATABASE_PATH>, which has been sanitized from the original path /lisc/data/scratch/molevo/dts/manifold/BCnSSimakov2022_current_rbh_202509/.
For detailed information about RBH file generation and format, see the ODP documentation: https://github.com/conchoecia/odp
Part 3: Comprehensive Phylogenetic Analysis of Chromosome Evolution
Directory: newick_and_timetree_20251118/
Overview
This directory contains a complete 9-step analytical workflow for phylogenetic analysis of chromosome evolution across 4,406 animal species. The analyses below focus on clades with 50 or more species. There were 141 clades that met this criterion. The workflow includes the generation of time-calibrated trees, ancestral linkage group (ALG) analyses, perspective chromosome analyses, and Fourier analysis of evolutionary patterns. All HPC paths have been sanitized to placeholders.
Summary Metrics:
- Total file count: ~17,000 files
- Total file size: ~9.7 GB
- Number of species analyzed: 4,406
- Number of analysis steps: 9
- Number of clades analyzed: 141
Finding Files for Your Clade of Interest
Quick Search Tips
To find results for your clade of interest, use file search or grep commands:
# Find all files mentioning your clade
find newick_and_timetree_20251118/ -name "*Lepidoptera*"
# Search for clade name in all files
grep -r "Mammalia" newick_and_timetree_20251118/
Per-Clade Analysis Locations
Detailed per-clade results are in:
newick_and_timetree_20251118/step7_tree_analysis_branchstatsvtime/
branch_stats_output/per_clade_analyses/
Events Over Time Analysis
Location: step7_tree_analysis_branchstatsvtime/branch_stats_output/per_clade_analyses/
For each clade, per-clade analyses of ALG fusions and dispersals per million years. Files are named as: [CladeName]_[NCBI_TaxID]_[plottype].[ext]
Main analysis files (140 clades):
*_changes_vs_intensity.pdf- Plots of clade-specific ALG fusion or dispersal rates versus species origination or extinction rates*_changes_vs_time.pdf- Clade-specific plots of ALG fusions or ALG dispersals over time*_event_count_conservation.pdf- Number of total events observed before and after phylogenetic weighting*_phylogenetic_weighting_verification.pdf- Effects of phylogenetic weighting on average rates*_phylogenetic_weights_diagnostic.pdf- Distribution of patristic-distance-summed phylogenetic weights for the clade*_temporal_heatmap.pdf- Heatmap of phylogenetic weight distribution at million-year time slices*_changes_vs_age_weighted.tsv- Weighted time-series data for changes*_changes_vs_age_unweighted.tsv- Unweighted time-series data for changes
Fourier Analysis (subdirectory: fourier_analysis/)
Location: step7_tree_analysis_branchstatsvtime/branch_stats_output/per_clade_analyses/fourier_analysis/
Detailed temporal pattern analysis for each clade.
For dispersal events (139 clades):
[CladeName]_dispersals_padded.pdf- Time-series plot of ALG dispersion events with temporal padding[CladeName]_dispersals_padded_chunk_support.tsv- Statistical support for temporal windows (chunks)[CladeName]_dispersals_padded_rsims.pdf- Randomization simulation results visualized[CladeName]_dispersals_padded_rsims.tsv- Randomization simulation data (for significance testing)
For fusion events (121 clades with sufficient data):
[CladeName]_fusions_padded.pdf- Time-series plot of ALG fusion events with temporal padding[CladeName]_fusions_padded_chunk_support.tsv- Statistical support for temporal windows[CladeName]_fusions_padded_rsims.pdf- Randomization simulation results visualized[CladeName]_fusions_padded_rsims.tsv- Randomization simulation data
Complete List of 141 Analyzed Clades
Invertebrates:
- Acalyptratae, Acanthomorphata, Annelida, Anthophila, Anthozoa
- Apocrita, Apoditrysia, Apoidea, Arachnida, Arthropoda
- Aschiza, Aculeata, Autobranchia, Bifurcata, Bilateria
- Brachycera, Calyptratae, Chelicerata, Chromadorea, Chrysomeloidea
- Cnidaria
- Coleoptera, Crambidae, Crustacea, Cucujiformia, Culicomorpha
- Cyclorrhapha, Demospongiae, Dicondylia, Diptera, Ditrysia
- Drosophila, Drosophilidae, Drosophilinae, Drosophilini, Ecdysozoa
- Endopterygota, Ennominae, Erebidae, Eremoneura, Euheterodonta
- Gastropoda
- Geometridae, Glossata, Hemiptera, Heteroconchia, Heteroneura
- Hexacorallia, Hexapoda, Hymenoptera, Ichneumonidae, Ichneumonoidea
- Imparidentia, Larentiinae, Lepidoptera, Lophotrochozoa, Mandibulata
- Metazoa, Mollusca, Muscomorpha, Myriazoa, Nematocera
- Nematoda, Neolepidoptera, Neoptera, Noctuidae, Noctuinae
- Noctuoidea, Nymphalidae, Obtectomera, Olethreutinae, Pancrustacea
- Papilionoidea, Paraneoptera, Polyneoptera, Polyphaga, Porifera
- Protostomia, Pterygota, Pyraloidea, Rhabditida, Rhabditidae
- Rhabditina, Rhabditomorpha, Satyrinae, Schizophora, Spiralia
- Syrphidae, Syrphoidea, Tortricidae, Unidentata
Fishes:
- Actinopteri, Actinopterygii, Clupeocephala, Ctenosquamata, Cypriniformes
- Cyprinoidei, Euacanthomorphacea, Eupercaria, Eurypterygia, Euteleosteomorpha
- Neopterygii, Ostariophysi, Osteoglossocephalai, Otomorpha, Otophysi
- Ovalentaria, Perciformes, Percomorphaceae, Teleostei
Tetrapods:
- Amniota, Amphiesmenoptera, Archelosauria, Archosauria, Artiodactyla
- Australaves, Aves, Boreoeutheria, Carnivora, Dipnotetrapodomorpha
- Episquamata, Euarchontoglires, Euteleostomi, Eutheria, Glires
- Laurasiatheria, Mammalia, Neoaves, Neognathae, Passeriformes
- Rodentia, Sauria, Telluraves, Tetrapoda, Theria
Deuterostomes:
- Chordata, Deuterostomia, Gnathostomata, Vertebrata
Two clades' phylogenetic positions were modified from the NCBI taxonomy tree (Myriazoa and Parahoxozoa), and these clades were assigned unique names and NCBI taxids.
Special Custom Analyses:
- Myriazoa_-67 (custom phylogeny with Ctenophora sister to other animals)
- Parahoxozoa_-68 (Parahoxozoa grouping)
Analysis Pipeline Overview
The workflow consists of 9 sequential steps:
- step1_generate_newick/ - Generate NCBI taxonomy tree
- step2_download_newick_from_timetree/ - TimeTree divergence time calibration
- step3_dispersal_characterization/ - ALG dispersion analysis
- step4_persp_chr/ - Perspective chromosome analysis (chromosome evolution tracking)
- step5_perspchangeplot/ - Change visualization
- step6_perspchrom_df_to_tree/ - Statistical tree mapping with simulations
- step7_tree_analysis_branchstatsvtime/ - Time-series analysis (contains per-clade results)
- step8_plot_tree_analysis/ - Tree visualizations
- step9_fourier/ - Fourier analysis of evolutionary patterns
- Note: Step9 scripts generate per-clade fourier analysis files, but these outputs are placed in the step7 directory structure at:
step7_tree_analysis_branchstatsvtime/branch_stats_output/per_clade_analyses/fourier_analysis/
Key Output Files
Phylogenetic trees:
step2.../20251130Tree.calibrated_tree.nwk- Time-calibrated phylogeny
Per-species results:
step4.../per_species_ALG_presence_fusions.tsv- ALG presence and fusions per species
Per-clade results:
step7.../branch_stats_output/per_clade_analyses/fourier_analysis/- 141 clades analyzedstep7.../branch_stats_output/major_clades_summary.tsv- Summary across all clades
Visualizations:
step8.../tree*.pdf- Annotated phylogenetic treesstep8.../matches_tree_composite_image.png- Composite visualization
Fourier analyses:
step9.../clade_specific_results.txt- Periodic patterns detected (in step9 directory)- Per-clade fourier results - Generated by step9 scripts but located in:
step7.../branch_stats_output/per_clade_analyses/fourier_analysis/
Important Notes
Path Sanitization
All HPC paths have been replaced with placeholders:
<GENOME_DB_PATH>- Genome assembly database<ODP_SCRIPTS_PATH>- ODP software scripts<ODP_INSTALL_PATH>- ODP installation<RBH_DATABASE_PATH>- RBH database<ALG_RBH_FILE>- ALG RBH file<PYTHON_ENV>- Python environment paths<ANALYSIS_PATH>- Analysis directory paths<HPC_PATH>- General HPC paths
Excluded Files
For space efficiency, excluded:
.pklcache files.err/.outSLURM logschangestring_checkpoints/directory (thousands of files)
All final results and scripts are included.
Software Requirements
- Python 3.7+
- ODP (Oxford Dot Plot) software
- NumPy, Pandas, Matplotlib, SciPy
- BioPython, DendroPy
- UMAP, scikit-learn
Glossary
- ALG (Ancestral Linkage Group): Chromosome segment linked in ancestral species
- Dispersion: Fragmentation of ALGs across multiple chromosomes
- Perspective chromosome analysis: Phylogenetic method tracking chromosome changes
- RBH (Reciprocal Best Hit): Method for identifying orthologous genes
- Synteny: Conservation of gene order along chromosomes
- Time calibration: Assigning absolute ages to phylogenetic nodes
Contact
- Darrin T. Schultz
- ORCID: https://orcid.org/0000-0003-1190-1122
License
CC0 1.0 Universal (CC0 1.0) - https://creativecommons.org/publicdomain/zero/1.0/
Version History
- v20251118 (February 25, 2026): Phylogenetic analysis workflow
- 4,406 species, 141 clades analyzed
- 9-step pipeline with paths sanitized
- Original: DOI 10.5061/dryad.m905qfvbv - Topological mixing dataset
Complete Directory Tree and File Descriptions
This section provides a comprehensive listing of all directories and files in the newick_and_timetree_20251118/ workflow.
Directory Structure Overview
newick_and_timetree_20251118/
├── step1_generate_newick/ (6 files)
├── step2_download_newick_from_timetree/ (9 files)
├── step3_dispersal_characterization/ (2 files + 2 species subdirs)
│ ├── alg_dispersion_plots/ (2 files)
│ └── odp_pairwise_decay/ (2 species directories)
│ ├── Branchiostomafloridae-7739-GCF000003815.2/ (4,352 files)
│ └── Pectenmaximus-6579-GCF902652985.1/ (4,352 files)
├── step4_persp_chr/ (8 files)
├── step5_perspchangeplot/ (2 files)
├── step6_perspchrom_df_to_tree/ (124 files total)
│ └── simulations/ (100 files)
├── step7_tree_analysis_branchstatsvtime/ (2,220 files total)
│ └── branch_stats_output/ (3 subdirectories)
│ ├── custom_clade_analyses/ (2 custom clades)
│ ├── global_diagnostics/ (5 files)
│ └── per_clade_analyses/ (1,120 files)
│ └── fourier_analysis/ (845 files)
├── step8_plot_tree_analysis/ (14 files)
└── step9_fourier/ (14 files)
Step-by-Step File Descriptions
Step 1: Generate Newick Tree (step1_generate_newick/)
Purpose: Generate NCBI taxonomy tree for all species
Files:
config.yaml- Configuration file with genome pathsgen_tree.sh- Script to generate a phylogenetic tree from the NCBI taxonomyncbi_tree.log- Log file from tree generationncbi_tree.nwk- Newick format phylogenetic tree from NCBI taxonomyspecies_list.txt- List of all 4,406 species analyzed (species name, NCBI TaxID, assembly accession)subspecies_to_species_conversions.tsv- Mapping of subspecies to species-level taxonomy
Step 2: Download TimeTree Calibration (step2_download_newick_from_timetree/)
Purpose: Time-calibrate phylogeny using TimeTree database (http://www.timetree.org)
Files:
20251130Tree.calibrated_tree.nwk- Main output: Time-calibrated phylogenetic tree in Newick format20251130Tree.divergence_times.txt- Pairwise divergence times between all taxa20251130Tree.edge_information.tsv- Edge-level information (branch lengths, node IDs)20251130Tree.node_ages_for_config.tsv- Node ages formatted for config files20251130Tree.node_information.tsv- Node-level metadata (ages, support values)20251130Tree.tree_report.txt- Summary report of tree calibrationnewick_timetree.nwk- Alternative Newick format outputrunlog.txt- Execution logrun_newick_to_common_ancestors.sh- Execution script
Step 3: Dispersal Characterization (step3_dispersal_characterization/)
Purpose: Calculate ALG dispersion metrics and pairwise chromosomal decay for reference species
Root Files:
config.yaml- Configuration filerun_characterize_dispersion.sh- Execution script to run dispersion characterization of Branchiostoma and Pecten (Figure 2)
Subdirectory: alg_dispersion_plots/
Purpose: Conservation ranking and dispersion analysis across all species
Files (2):
ALG_conservation_ranking_BCnSSimakov2022.tsv- Conservation ranking of all ALGs across the datasetALG_dispersion_BCnSSimakov2022.pdf- Figure 2: Visualization of ALG dispersion patterns across phylogeny
Subdirectory: odp_pairwise_decay/
Purpose: Detailed pairwise chromosomal synteny decay for two reference species
Contains complete pairwise synteny decay analyses between reference species and all 4,348 other species in the dataset.
Directory: Branchiostomafloridae-7739-GCF000003815.2/ (lancelet, Branchiostoma floridae)
decay_dataframes/(4,348 TSV files)- Pattern:
Branchiostomafloridae-7739-GCF000003815.2_vs_[Species]-[TaxID]-[Accession]_chromosomal_decay.tsv - Content: Pairwise synteny decay data between lancelet and each target species
- Columns: Chromosome pairs, synteny scores, decay metrics
- Pattern:
plot_ALG_dispersion/(1 PDF file)Branchiostomafloridae-7739-GCF000003815.2_ALG_dispersion_by_conservation.pdf- ALG dispersion by conservation rank for lancelet
plot_overview_sp_sp/(1 PDF file)Branchiostomafloridae-7739-GCF000003815.2_decay_plot_vs_divergence_time.pdf- Synteny decay vs. divergence time for all pairwise comparisons
Directory: Pectenmaximus-6579-GCF902652985.1/ (scallop, Pecten maximus)
decay_dataframes/(4,348 TSV files)- Pattern:
Pectenmaximus-6579-GCF902652985.1_vs_[Species]-[TaxID]-[Accession]_chromosomal_decay.tsv - Content: Pairwise synteny decay data between the scallop and each target species
- Columns: Chromosome pairs, synteny scores, decay metrics
- Pattern:
plot_ALG_dispersion/(1 PDF file)Pectenmaximus-6579-GCF902652985.1_ALG_dispersion_by_conservation.pdf- ALG dispersion by conservation rank for scallop
plot_overview_sp_sp/(1 PDF file)Pectenmaximus-6579-GCF902652985.1_decay_plot_vs_divergence_time.pdf- Synteny decay vs. divergence time for all pairwise comparisons
Step 4: Per-species Chromosome Analysis (step4_persp_chr/)
Purpose: Track chromosome evolution across phylogeny (perspective chromosome method)
Files:
locdf.tsv- Location dataframe with genomic coordinatesmissing_taxa_from_calibrated_tree.txt- Taxa in tree but missing from analysisother_ecdysozoa.txt- Non-standard ecdysozoan taxaoutput.log- Execution logperspchrom.tsv- Main output: Perspective chromosome assignments for all speciesper_species_ALG_presence_fusions.tsv- Per-species summary of ALG presence and fusion eventsrun_persp_chr.sh- Execution scriptunique_changes_summary.tsv- Summary of unique chromosome change events
Step 5: Perspective Change Plot (step5_perspchangeplot/)
Purpose: Visualize the relationship between chromosome number and evolutionary changes
Files:
chrom_number_vs_changes.pdf- Scatter plot of chromosome number vs. evolutionary change events. Used in Fig. 3.run_makeplot.sh- Plotting script
Step 6: Perspective Chromosome to Tree (step6_perspchrom_df_to_tree/)
Purpose: Map chromosome changes onto a phylogenetic tree with statistical simulations. Used in Fig. S1.
Root Files (24 total):
run_persp2tree.sh- Execution script- 14 PDF files:
simulations_[TaxID]_[CladeName].pdf- Simulation results for major clades:- Acoela, Bilateria, Cnidaria, Ctenophora, Demospongiae, Deuterostomia
- Ecdysozoa, Metazoa, Nematoda, Annelida, Porifera, Protostomia, Spiralia
Subdirectory: simulations/
Purpose: Statistical simulations to test the significance of chromosome change patterns
Files (100):
- 100 files:
sim_results_[N]_10.tsv(where N = 0-99)- Each file contains 10 simulation replicates testing the null hypothesis of random chromosome changes
- Used to calculate the statistical significance of observed patterns
Step 7: Tree Analysis - Branch Stats vs Time (step7_tree_analysis_branchstatsvtime/)
Purpose: Analyze chromosome evolution rates over time for all clades (≥50 species)
Root Files (5):
diagnose_correction.sh- Diagnostic script for phylogenetic correctionsextinction_intensity.tsv- Species extinction intensity estimatesrun_plotbranchstatstime.sh- Main execution scriptrun_plotbranchstatstime_protost_minus_clitellata.sh- Custom analysis for Protostomia minus Clitellatarun_plotbranchstatstime_verte_minus_teleost.sh- Custom analysis for Vertebrata minus Teleostei
Subdirectory: branch_stats_output/
Root Files:
major_clades_summary.tsv- Summary statistics for major cladesmodified_edge_list.tsv- Edge list with chromosome change annotationsmodified_node_list.tsv- Node list with age and evolutionary rate data
Subdirectory: global_diagnostics/ (5 PDFs)
Global analyses across all clades:
ALL_event_count_conservation.pdf- Conservation of event counts after phylogenetic weightingALL_phylogenetic_weighting_verification.pdf- Verification of weighting methodologyALL_phylogenetic_weights_by_clade.pdf- Distribution of weights across all cladesALL_phylogenetic_weights_diagnostic.pdf- Diagnostic plots for weight calculationsALL_temporal_heatmap.pdf- Heatmap of evolutionary activity across all time periods
Subdirectory: custom_clade_analyses/ (2 custom clade directories)
Special analyses for clades with modified phylogenetic positions:
Directory: Protostomia_33317_minus_Clitellata_42113/
- 8 analysis files (same types as per_clade_analyses)
fourier_analysis/subdirectory with 6 files (3 dispersals, 3 fusions)
Directory: Vertebrata_7742_minus_Teleostei_32443/
- 8 analysis files (same types as per_clade_analyses)
fourier_analysis/subdirectory with 6 files (3 dispersals, 3 fusions)
Subdirectory: per_clade_analyses/ (1,120 files)
File Pattern: [CladeName]_[TaxID]_[analysis_type].[ext]
Per-clade analysis files (8 files × 140 clades = 1,120 files):
*_changes_vs_intensity.pdf(140 PDFs)- Plots chromosome changes (fusions/dispersals) vs. species diversification/extinction rates
- Tests the correlation between chromosome evolution and diversification
*_changes_vs_time.pdf(140 PDFs)- Time series of chromosome change rates (fusions and dispersals per million years)
- Shows temporal patterns of chromosome evolution
*_event_count_conservation.pdf(140 PDFs)- Comparison of event counts before and after phylogenetic weighting
- Validates phylogenetic correction methodology
*_phylogenetic_weighting_verification.pdf(140 PDFs)- Verification that phylogenetic weighting removes branch length bias
- Shows average rates across species with different amounts of data
*_phylogenetic_weights_diagnostic.pdf(140 PDFs)- Distribution of phylogenetic weights across species in the clade
- Diagnostic to identify species with unusual weighting
*_temporal_heatmap.pdf(140 PDFs)- Heatmap showing distribution of phylogenetic weight across time slices
- Identifies time periods with more/less data coverage
*_changes_vs_age_weighted.tsv(140 TSV files)- Data file: Time-series data with phylogenetic weighting applied
- Columns: time_mya, fusion_rate, dispersal_rate, num_species, total_weight
*_changes_vs_age_unweighted.tsv(140 TSV files)- Data file: Raw time-series data without phylogenetic correction
- Columns: time_mya, fusion_count, dispersal_count, num_branches
Subdirectory: per_clade_analyses/fourier_analysis/ (845 files)
Purpose: Detect periodic patterns in chromosome evolution using Fourier analysis
File Patterns:
Dispersal Analysis (141 clades × 4 files = 564 files):
[CladeName]_dispersals_padded.pdf(141 PDFs)- Time-series plot of ALG dispersal events with temporal padding
- Shows data, fitted trend, and detected periodicities
[CladeName]_dispersals_padded_chunk_support.tsv(141 TSV files)- Statistical support for temporal windows (chunks)
- Columns: chunk_ID, start_mya, end_mya, p_value, significance
[CladeName]_dispersals_padded_rsims.pdf(141 PDFs)- Visualization of randomization test results
- Shows observed vs. expected Fourier power spectrum
[CladeName]_dispersals_padded_rsims.tsv(141 TSV files)- Randomization simulation data for significance testing
- Columns: frequency, observed_power, mean_random_power, p_value
Fusion Analysis (121 clades × 4 files = 484 files; fewer clades due to insufficient fusion events):
[CladeName]_fusions_padded.pdf(121 PDFs)- Time-series plot of ALG fusion events with temporal padding
[CladeName]_fusions_padded_chunk_support.tsv(121 TSV files)- Statistical support for temporal windows in fusion events
[CladeName]_fusions_padded_rsims.pdf(121 PDFs)- Randomization test visualization for fusion events
[CladeName]_fusions_padded_rsims.tsv(121 TSV files)- Randomization simulation data for fusion events
Note: Some clades have insufficient fusion events for reliable Fourier analysis, resulting in 121 clades with fusion analysis vs. 141 with dispersal analysis.
Step 8: Plot Tree Analysis (step8_plot_tree_analysis/)
Purpose: Generate annotated phylogenetic tree visualizations
Files (14):
run_plotannotatedtree.sh- Plotting scripttree.pdf- Base phylogenetic tree visualization
Bivariate visualizations (chromosome number changes):
tree_2d_bivariate.pdf- Bivariate tree with fusion/dispersal ratestree_2d_bivariate_balanced_purple.pdf- Color scheme varianttree_2d_bivariate_lavender.pdf- Color scheme varianttree_2d_bivariate_overlay.pdf- Overlay visualizationtree_2d_bivariate_softer_purple.pdf- Color scheme varianttree_2d_bivariate_true_magenta.pdf- Color scheme varianttree_2d_bivariate_violet.pdf- Color scheme variant
Diagnostic visualizations:
tree_diagnostic_dispersals_overlay.pdf- Dispersal events mapped to the treetree_diagnostic_fusions_overlay.pdf- Fusion events mapped to the treetree_split_side.pdf- Split-view tree visualization
Composite images:
matches_tree_composite_image.png- Clustered composite of a tree with chromosome evolution patternsmatches_tree_composite_image_unclustered.png- Unclustered composite
Step 9: Fourier Analysis (step9_fourier/)
Purpose: Control scripts for running Fourier analysis on the HPC cluster
Files (14):
branches_per_age.pdf- Diagnostic: number of phylogenetic branches per time sliceclade_specific_results.txt- Summary: Detected periodic patterns for all cladescount_branches.py- Python script to count branches per time windowfourier_files.txt- List of input files for Fourier analysissupport_vs_time_window.pdf- Diagnostic: statistical power vs. time window sizevertebrate_minus_teleost_time.py- Custom script for Vertebrata minus Teleostei analysis
Execution scripts (generate files in step7/fourier_analysis/):
run_fourier.sh- Main Fourier analysis scriptrun_fourier2.sh- Alternative execution scriptrun_fourier_array.sh- Array job script for parallel executionsubmit_fourier_array.sh- SLURM submission scriptrun_protostomes_minus_clitellata.sh- Custom analysis for Protostomia minus Clitellatarun_vertebrates.sh- Vertebrate-specific analysisrun_vertebrates350.sh- Vertebrate analysis with 350 Myr windowrun_vertebrates_loop.sh- Vertebrate analysis in loop
Note: Step 9 scripts generate per-clade Fourier analysis outputs, but these files are placed in:
step7_tree_analysis_branchstatsvtime/branch_stats_output/per_clade_analyses/fourier_analysis/
File Count Summary
| Component | Description | File Count |
|---|---|---|
| RBH Database | BCnS RBH Files | 5,821 |
| Workflow | Phylogenetic Analysis Steps | 11,101 |
| Step 1 | Generate Newick | 6 |
| Step 2 | TimeTree Calibration | 9 |
| Step 3 | Dispersal Characterization | 8,704 |
| Step 4 | Perspective Chromosome | 8 |
| Step 5 | Change Plot | 2 |
| Step 6 | Tree Mapping + Simulations | 124 |
| Step 7 | Branch Stats vs Time | 2,220 |
| Step 8 | Tree Visualizations | 14 |
| Step 9 | Fourier Control Scripts | 14 |
| Total | ~16,922 files |
Key Figure References
Main Figures in Paper:
- Figure 2:
step3_dispersal_characterization/alg_dispersion_plots/ALG_dispersion_BCnSSimakov2022.pdf- ALG dispersion patterns across phylogeny showing conservation ranking
Supplementary Figures:
- Synteny decay:
step3_dispersal_characterization/odp_pairwise_decay/[Species]/plot_overview_sp_sp/*.pdf - Per-clade time series:
step7_tree_analysis_branchstatsvtime/branch_stats_output/per_clade_analyses/*_changes_vs_time.pdf - Fourier analysis:
step7_tree_analysis_branchstatsvtime/branch_stats_output/per_clade_analyses/fourier_analysis/*.pdf - Annotated trees:
step8_plot_tree_analysis/tree*.pdf
