Supplementary data from: Integrating secondary structure information enhances phylogenetic signal in mitochondrial protein coding genes
Data files
Mar 19, 2026 version files 6.01 MB
-
README.md
6.57 KB
-
Suppl_Data_S1.zip
2.77 MB
-
Suppl_Fig_S1.jpg
146.34 KB
-
Suppl_Fig_S2.jpg
1.68 MB
-
Suppl_Fig_S3.jpg
449.12 KB
-
Suppl_Fig_S4.jpg
943.55 KB
-
Suppl_Table_S1.xlsx
14.84 KB
-
Suppl_Table_S2.csv
937 B
Abstract
Accurate phylogenetic inference requires models that account for heterogeneity in molecular evolution. Mitochondrial protein-coding genes, which encode membrane-bound proteins composed of multiple transmembrane α-helices, exhibit considerable compositional and functional variation across structural regions, variation that is often overlooked in standard partitioning strategies. Here, we introduce TRAMPO (TRAnsMembrane Protein Order), a novel pipeline that incorporates predicted secondary structural features (i.e., matrix-facing, transmembrane, and intermembrane-facing domains) into phylogenetic partitioning schemes. We applied TRAMPO to seven mitochondrial datasets, spanning crustaceans, hexapods, and vertebrates, and evaluated eight partitioning strategies based on combinations of codon position, strand, and secondary structure. Transmembrane helices showed pronounced thymine enrichment at second codon positions and hydrophobic amino-acid composition, reflecting domain-specific evolutionary constraints. To assess whether these structural patterns influence phylogenetic reconstruction, we performed maximum likelihood analyses under Markov models with various degrees of complexity (ranging from standard Markov models, via Lie Markov and General Heterogeneous evolution on a Single Topology Markov models, to profile mixture Markov models). We also evaluated different models of rate-heterogeneity across sites (including the invariable sites model, gamma-distribution model, and FreeRate model) to examine their interaction with partitioning strategies and overall model performance. Incorporating structural information into partitioning schemes consistently improved model fit and reduced apparent heterogeneity, as reflected in lower AIC values and more compositionally homogeneous partitions. These improvements translated into more consistent and topologically congruent phylogenetic trees across most datasets, while also reducing computational time. Notably, second codon positions in DNA that encode transmembrane helices were consistently retained as distinct partitions during model optimization, even in Mammals and Vertebrates, where secondary structure contributed little to overall model performance, underscoring their strong and conserved evolutionary signal. Surveys of tree space using quartet distances further supported these findings, with structurally informed models yielding more tightly clustered and internally consistent tree topologies. The benefits of structural partitioning were most pronounced in lineages of intermediate evolutionary depth and declined in ancient vertebrate and mammalian clades, where substitutional saturation accumulates with evolutionary time and strand asymmetry tends to emerge more frequently. In some cases, models with the lowest AIC did not yield the most congruent topologies, underscoring the limitations of information criteria when comparing models of different complexity. Overall, our findings demonstrate that secondary structural features, particularly the repetitive architecture of transmembrane helices, harbour meaningful phylogenetic signal. Incorporating this information into partitioning schemes improves tree reconstruction and mitigates underlying heterogeneity. TRAMPO provides a scalable, open-source tool to implement this approach in mitochondrial phylogenetics.
Dataset DOI: 10.5061/dryad.wh70rxx29
Description of the data and file structure
Supplementary Figure S1. Diagram illustrating the eight partitioning schemes defined by the TRAMPO pipeline, which were subsequently used in maximum likelihood (ML) analyses.
Supplementary Figure S2. Scatter plots of A-skew (x-axis) versus G-skew (y-axis) across strand orientations (positive and negative), three transmembrane domains (IM, MA, TM), and three codon positions for the seven lineages analysed in this study: Collembola, Hyalella, Metacrangonyx, Pseudoniphargus, Mammals, Primates, and Vertebrates (shown on pages 1 to 7, respectively). Partitions are labelled as strand_domain_codon position. Strand differences are not shown for Mammals, Primates, and Vertebrates due to the presence of only a single gene encoded on the negative strand in these lineages.
Supplementary Figure S3. Box plots of G+C frequency across two strand orientations (positive and negative), three transmembrane domains (IM, MA, TM), and three codon positions for the seven lineages analysed in this study: Collembola, Hyalella, Metacrangonyx, Pseudoniphargus, Mammals, Primates, and Vertebrates (shown on pages 1 to 7, respectively). Partitions are named as strand_domain and the tree codon positions are depicted as different panels. Strand differences are not shown for Mammals, Primates, and Vertebrates due to the presence of only a single gene encoded on the negative strand in these lineages.
Supplementary Figure S4. Frequency of amino acids grouped into six chemical classes across two strand orientations (positive and negative) and three transmembrane domains (IM, MA, TM) for the seven lineages analysed in this study: Collembola, Hyalella, Metacrangonyx, Pseudoniphargus, Mammals, Primates, and Vertebrates (shown on pages 1 to 7, respectively). Partitions are labelled as domain_strand. Strand differences are not shown for Mammals, Primates, and Vertebrates due to the presence of only a single gene encoded on the negative strand in these lineages. The six chemical classes are: G1 Small (A, G, P, S, T); G2 Acidic/Amide (D, E, N, Q); G3 Basic (H, K, R); G4 Hydrophobic (I, L, M, V); G5 Aromatic (F, W, Y); and G6 Sulfur-containing (C).
Supplementary Table S1. Species names and accession numbers of the mitochondrial genomes for the taxa included in the seven datasets analysed in this study.
Supplementary Table S2. UniProt-SwissProt accession numbers of mitochondrial protein sequences from the seven representative model organisms used as references in the TRAMPO pipeline.
Supplementary Data S1 (Suppl_Data_S1.zip). Multiple sequence alignments in fasta format and partition schemes in nexus format. The NEXUS files include charset definitions corresponding to multiple partitioning schemes derived from codon position, strand, and domain information for the seven lineages studied: Collembola, Hyalella, mammals, Metacrangonyx, primates, Pseudoniphardus, and vertebrates. See TRAMPO github for file naming (https://github.com/dbajpp0/TRAMPO). Python 3 scripts to analyze the distribution of site probability categories (GHOST p1–p10) across different genomic regions defined by strand, structural domain, and codon position.
File: Suppl_Fig_S1.jpg
Description: Diagram illustrating the eight partitioning schemes defined by the TRAMPO pipeline, which were subsequently used in maximum likelihood (ML) analyses.
File: Suppl_Table_S1.xlsx
Description: Species names and accession numbers of the mitochondrial genomes for the taxa included in the seven datasets analysed in this study.
Variables
- Species names and accession numbers
File: Suppl_Fig_S2.jpg
Description: Scatter plots of A-skew (x-axis) versus G-skew (y-axis) across strand orientations (positive and negative), three transmembrane domains (IM, MA, TM), and three codon positions for the seven lineages analysed in this study: Collembola, Hyalella, Metacrangonyx, Pseudoniphargus, Mammals, Primates, and Vertebrates (shown on pages 1 to 7, respectively). Partitions are labelled as strand_domain_codon position. Strand differences are not shown for Mammals, Primates, and Vertebrates due to the presence of only a single gene encoded on the negative strand in these lineages.
File: Suppl_Table_S2.csv
Description: UniProt-SwissProt accession numbers of mitochondrial protein sequences from the seven representative model organisms used as references in the TRAMPO pipeline.
Variables
- Gene names, species and taxonomic rank
File: Suppl_Fig_S3.jpg
Description: Box plots of G+C frequency across two strand orientations (positive and negative), three transmembrane domains (IM, MA, TM), and three codon positions for the seven lineages analysed in this study: Collembola, Hyalella, Metacrangonyx, Pseudoniphargus, Mammals, Primates, and Vertebrates (shown on pages 1 to 7, respectively). Partitions are named as strand_domain and the tree codon positions are depicted as different panels. Strand differences are not shown for Mammals, Primates, and Vertebrates due to the presence of only a single gene encoded on the negative strand in these lineages.
File: Suppl_Fig_S4.jpg
Description: Frequency of amino acids grouped into six chemical classes across two strand orientations (positive and negative) and three transmembrane domains (IM, MA, TM) for the seven lineages analysed in this study: Collembola, Hyalella, Metacrangonyx, Pseudoniphargus, Mammals, Primates, and Vertebrates (shown on pages 1 to 7, respectively). Partitions are labelled as domain_strand. Strand differences are not shown for Mammals, Primates, and Vertebrates due to the presence of only a single gene encoded on the negative strand in these lineages. The six chemical classes are: G1 Small (A, G, P, S, T); G2 Acidic/Amide (D, E, N, Q); G3 Basic (H, K, R); G4 Hydrophobic (I, L, M, V); G5 Aromatic (F, W, Y); and G6 Sulfur-containing (C).
Code/software
OpenOffice, LibreOffice or similar for exel type files and firefox or any other picture viewer
Access information
Other publicly accessible locations of the data:
- None
Data was derived from the following sources:
