Supplementary data from: Integrating secondary structure information enhances phylogenetic signal in mitochondrial protein coding genes

Cucini, Claudio1; Nardi, Francesco1; Pons, Joan 2

Research facility: Mediterranean Institute for Advanced Studies

Published Mar 19, 2026 on Dryad. https://doi.org/10.5061/dryad.wh70rxx29

Data files

Mar 19, 2026 version files 6.01 MB

README.md

6.57 KB
Suppl_Data_S1.zip

2.77 MB
Suppl_Fig_S1.jpg

146.34 KB
Suppl_Fig_S2.jpg

1.68 MB
Suppl_Fig_S3.jpg

449.12 KB
Suppl_Fig_S4.jpg

943.55 KB
Suppl_Table_S1.xlsx

14.84 KB
Suppl_Table_S2.csv

937 B

Abstract

Accurate phylogenetic inference requires models that account for heterogeneity in molecular evolution. Mitochondrial protein-coding genes, which encode membrane-bound proteins composed of multiple transmembrane α-helices, exhibit considerable compositional and functional variation across structural regions, variation that is often overlooked in standard partitioning strategies. Here, we introduce TRAMPO (TRAnsMembrane Protein Order), a novel pipeline that incorporates predicted secondary structural features (i.e., matrix-facing, transmembrane, and intermembrane-facing domains) into phylogenetic partitioning schemes. We applied TRAMPO to seven mitochondrial datasets, spanning crustaceans, hexapods, and vertebrates, and evaluated eight partitioning strategies based on combinations of codon position, strand, and secondary structure. Transmembrane helices showed pronounced thymine enrichment at second codon positions and hydrophobic amino-acid composition, reflecting domain-specific evolutionary constraints. To assess whether these structural patterns influence phylogenetic reconstruction, we performed maximum likelihood analyses under Markov models with various degrees of complexity (ranging from standard Markov models, via Lie Markov and General Heterogeneous evolution on a Single Topology Markov models, to profile mixture Markov models). We also evaluated different models of rate-heterogeneity across sites (including the invariable sites model, gamma-distribution model, and FreeRate model) to examine their interaction with partitioning strategies and overall model performance. Incorporating structural information into partitioning schemes consistently improved model fit and reduced apparent heterogeneity, as reflected in lower AIC values and more compositionally homogeneous partitions. These improvements translated into more consistent and topologically congruent phylogenetic trees across most datasets, while also reducing computational time. Notably, second codon positions in DNA that encode transmembrane helices were consistently retained as distinct partitions during model optimization, even in Mammals and Vertebrates, where secondary structure contributed little to overall model performance, underscoring their strong and conserved evolutionary signal. Surveys of tree space using quartet distances further supported these findings, with structurally informed models yielding more tightly clustered and internally consistent tree topologies. The benefits of structural partitioning were most pronounced in lineages of intermediate evolutionary depth and declined in ancient vertebrate and mammalian clades, where substitutional saturation accumulates with evolutionary time and strand asymmetry tends to emerge more frequently. In some cases, models with the lowest AIC did not yield the most congruent topologies, underscoring the limitations of information criteria when comparing models of different complexity. Overall, our findings demonstrate that secondary structural features, particularly the repetitive architecture of transmembrane helices, harbour meaningful phylogenetic signal. Incorporating this information into partitioning schemes improves tree reconstruction and mitigates underlying heterogeneity. TRAMPO provides a scalable, open-source tool to implement this approach in mitochondrial phylogenetics.

Dataset DOI: 10.5061/dryad.wh70rxx29

Description of the data and file structure

Supplementary Figure S1. Diagram illustrating the eight partitioning schemes defined by the TRAMPO pipeline, which were subsequently used in maximum likelihood (ML) analyses.

Supplementary Figure S2. Scatter plots of A-skew (x-axis) versus G-skew (y-axis) across strand orientations (positive and negative), three transmembrane domains (IM, MA, TM), and three codon positions for the seven lineages analysed in this study: Collembola, Hyalella, Metacrangonyx, Pseudoniphargus, Mammals, Primates, and Vertebrates (shown on pages 1 to 7, respectively). Partitions are labelled as strand_domain_codon position. Strand differences are not shown for Mammals, Primates, and Vertebrates due to the presence of only a single gene encoded on the negative strand in these lineages.

Supplementary Figure S3. Box plots of G+C frequency across two strand orientations (positive and negative), three transmembrane domains (IM, MA, TM), and three codon positions for the seven lineages analysed in this study: Collembola, Hyalella, Metacrangonyx, Pseudoniphargus, Mammals, Primates, and Vertebrates (shown on pages 1 to 7, respectively). Partitions are named as strand_domain and the tree codon positions are depicted as different panels. Strand differences are not shown for Mammals, Primates, and Vertebrates due to the presence of only a single gene encoded on the negative strand in these lineages.

Supplementary Figure S4. Frequency of amino acids grouped into six chemical classes across two strand orientations (positive and negative) and three transmembrane domains (IM, MA, TM) for the seven lineages analysed in this study: Collembola, Hyalella, Metacrangonyx, Pseudoniphargus, Mammals, Primates, and Vertebrates (shown on pages 1 to 7, respectively). Partitions are labelled as domain_strand. Strand differences are not shown for Mammals, Primates, and Vertebrates due to the presence of only a single gene encoded on the negative strand in these lineages. The six chemical classes are: G1 Small (A, G, P, S, T); G2 Acidic/Amide (D, E, N, Q); G3 Basic (H, K, R); G4 Hydrophobic (I, L, M, V); G5 Aromatic (F, W, Y); and G6 Sulfur-containing (C).

Supplementary Table S1. Species names and accession numbers of the mitochondrial genomes for the taxa included in the seven datasets analysed in this study.

Supplementary Table S2. UniProt-SwissProt accession numbers of mitochondrial protein sequences from the seven representative model organisms used as references in the TRAMPO pipeline.

Supplementary Data S1 (Suppl_Data_S1.zip). Multiple sequence alignments in fasta format and partition schemes in nexus format. The NEXUS files include charset definitions corresponding to multiple partitioning schemes derived from codon position, strand, and domain information for the seven lineages studied: Collembola, Hyalella, mammals, Metacrangonyx, primates, Pseudoniphardus, and vertebrates. See TRAMPO github for file naming (https://github.com/dbajpp0/TRAMPO). Python 3 scripts to analyze the distribution of site probability categories (GHOST p1–p10) across different genomic regions defined by strand, structural domain, and codon position.

File: Suppl_Fig_S1.jpg

Description: Diagram illustrating the eight partitioning schemes defined by the TRAMPO pipeline, which were subsequently used in maximum likelihood (ML) analyses.

File: Suppl_Table_S1.xlsx

Description: Species names and accession numbers of the mitochondrial genomes for the taxa included in the seven datasets analysed in this study.

Variables

Species names and accession numbers

File: Suppl_Fig_S2.jpg

Description: Scatter plots of A-skew (x-axis) versus G-skew (y-axis) across strand orientations (positive and negative), three transmembrane domains (IM, MA, TM), and three codon positions for the seven lineages analysed in this study: Collembola, Hyalella, Metacrangonyx, Pseudoniphargus, Mammals, Primates, and Vertebrates (shown on pages 1 to 7, respectively). Partitions are labelled as strand_domain_codon position. Strand differences are not shown for Mammals, Primates, and Vertebrates due to the presence of only a single gene encoded on the negative strand in these lineages.

File: Suppl_Table_S2.csv

Description: UniProt-SwissProt accession numbers of mitochondrial protein sequences from the seven representative model organisms used as references in the TRAMPO pipeline.

Variables

Gene names, species and taxonomic rank

File: Suppl_Fig_S3.jpg

Description: Box plots of G+C frequency across two strand orientations (positive and negative), three transmembrane domains (IM, MA, TM), and three codon positions for the seven lineages analysed in this study: Collembola, Hyalella, Metacrangonyx, Pseudoniphargus, Mammals, Primates, and Vertebrates (shown on pages 1 to 7, respectively). Partitions are named as strand_domain and the tree codon positions are depicted as different panels. Strand differences are not shown for Mammals, Primates, and Vertebrates due to the presence of only a single gene encoded on the negative strand in these lineages.

File: Suppl_Fig_S4.jpg

Description: Frequency of amino acids grouped into six chemical classes across two strand orientations (positive and negative) and three transmembrane domains (IM, MA, TM) for the seven lineages analysed in this study: Collembola, Hyalella, Metacrangonyx, Pseudoniphargus, Mammals, Primates, and Vertebrates (shown on pages 1 to 7, respectively). Partitions are labelled as domain_strand. Strand differences are not shown for Mammals, Primates, and Vertebrates due to the presence of only a single gene encoded on the negative strand in these lineages. The six chemical classes are: G1 Small (A, G, P, S, T); G2 Acidic/Amide (D, E, N, Q); G3 Basic (H, K, R); G4 Hydrophobic (I, L, M, V); G5 Aromatic (F, W, Y); and G6 Sulfur-containing (C).

Code/software

OpenOffice, LibreOffice or similar for exel type files and firefox or any other picture viewer

Access information

Other publicly accessible locations of the data:

None

Data was derived from the following sources:

https://www.biorxiv.org/content/10.1101/2024.08.01.606191v1

Supplementary data from: Integrating secondary structure information enhances phylogenetic signal in mitochondrial protein coding genes

Data files

Abstract

README: Supplementary data from: Integrating secondary structure information enhances phylogenetic signal in mitochondrial protein coding genes

Description of the data and file structure

File: Suppl_Fig_S1.jpg

File: Suppl_Table_S1.xlsx

Variables

File: Suppl_Fig_S2.jpg

File: Suppl_Table_S2.csv

Variables

File: Suppl_Fig_S3.jpg

File: Suppl_Fig_S4.jpg

Code/software

Access information