Supplementary data for: DNA sequences are as useful as protein sequences for inferring deep phylogenies

Kapli, Paschalia 1 ; Kotari, Ioanna2 ; Telford, Maximilian J1 ; Goldman, Nick3 ; Yang, Ziheng1

Published Jun 28, 2023 on Dryad. https://doi.org/10.5061/dryad.sbcc2fr85

Data files

Jun 28, 2023 version files 6.33 MB

example_control_file_BSH.txt

32.21 KB
example_control_file_SH2.txt

11.44 KB
HOMO-control.txt

2.67 KB
metazoa_alignment.CODONS.clean2.fasta.tar.gz
6.27 MB
README.txt

2.64 KB
SH1-control.txt

10.67 KB

Abstract

Inference of deep phylogenies has almost exclusively used protein rather than DNA sequences, based on the perception that protein sequences are less prone to homoplasy and saturation or to issues of compositional heterogeneity than DNA sequences. Here we analyze a model of codon evolution under an idealized genetic code and demonstrate that those perceptions may be misconceptions. We conduct a simulation study to assess the utility of protein versus DNA sequences for inferring deep phylogenies, with protein-coding data generated under models of heterogeneous substitution processes across sites in the sequence and among lineages on the tree, and then analyzed using nucleotide, amino acid, and codon models. Analysis of DNA sequences under nucleotide-substitution models (possibly with the third codon positions excluded) recovered the correct tree at least as often as analysis of the corresponding protein sequences under modern amino acid models. We also applied the different data-analysis strategies to an empirical dataset to infer the metazoan phylogeny. Our results from both simulated and real data suggest that DNA sequences may be as useful as proteins for inferring deep phylogenies and should not be excluded from such analyses. Analysis of DNA data under nucleotide models has a major computational advantage over protein-data analysis, potentially making it feasible to use advanced models that account for among-site and among-lineage heterogeneity in the nucleotide-substitution process in inference of deep phylogenies.

Supplementary data for the manuscript:
Kapli P., Kotari I., Telford M., Goldman N., Yang Z. DNA Sequences Are as Useful as Protein Sequences for Inferring Deep Phylogenies.

Brief explanations of the files:

The script "convert.py" takes as input a codon alignment and outputs the equivalent amino acid and the DNA alignment of the 1st and 2nd codon positions.
The "HOMO-control.txt" is the control file for simulating sequences under the homogeneous model with indelible (http://abacus.gene.ucl.ac.uk/software/indelible/). All guide trees and model parameters (M0 and M3) are provided in the file.
The "SH1-control.txt" is the control file for simulating sequences under the site-heterogeneous (SH1) model. SH1 assumes site-heterogeneous codon frequencies generated from observed frequencies in coding genes from mammal species. All guide trees and model parameters (M0 and M3) are provided in the file.
The two Python scripts: "generate_control_M0_SH2.py" and "generate_control_M3_SH2.py" create control files for indelible with the SH2 model. In particular, it converts the amino acid frequencies from the mixture models C10-C60 into equivalent codon frequencies, i.e. the frequency of each codon is calculated by the frequency of the amino acid divided by the number of synonymous codons for the amino acid and multiplied by the nucleotide frequency for the nucleotide at the 3rd codon position. It then formats the indelible control file accordingly. All guide trees are available in each of the scripts.

Syntax for running the script:
./generate_control_M[0 or 3].py [AA mixture model: C10-C60] [frequency C] [frequency T] [frequency A] [ALIGNMENT LENGTH]

An example output of the generate_control_M0.py is also provided: example_control_file_SH2.txt.

The equivalent scripts "generate_control_M3_BSH.py" and "generate_control_M3_BSH.py" as in SH2 but for the branch-site model.

Syntax for running the script:
./generate_control_M[0 or 3].py [AA mixture model: C10-C60] [frequency C 1] [frequency T 1] [frequency A 1] [frequency C 2] [frequency T 2] [frequency A 2] [ALIGNMENT LENGTH]

The two sets of frequencies are assigned to different parts of the tree.

For the metazoa analyses the following files are included: 1) "metazoa_alignment.CODONS.clean2.fasta.tar.gz" the DNA alignment for the 22 animal species used in the study, and 2) two scripts: "generate_control_M3_BSH-metazoa.py" and "generate_control_M3_SH2-metazoa.py" that generate indelible control files under the BSH and the SH2 models correspondingly and with guide trees matching the metazoa phylogenies A and C of Figure 3 in the manuscript.

Supplementary data for: DNA sequences are as useful as protein sequences for inferring deep phylogenies

Data files

Abstract

Methods

Usage notes

Works referencing this dataset