Supplementary data for the manuscript:
Kapli P., Kotari I., Telford M., Goldman N., Yang Z. DNA Sequences Are as Useful as Protein Sequences for Inferring Deep Phylogenies.
Brief explanations of the files:
- The script "convert.py" takes as input a codon alignment and outputs the equivalent amino acid and the DNA alignment of the 1st and 2nd codon positions.
- The "HOMO-control.txt" is the control file for simulating sequences under the homogeneous model with indelible (http://abacus.gene.ucl.ac.uk/software/indelible/). All guide trees and model parameters (M0 and M3) are provided in the file.
- The "SH1-control.txt" is the control file for simulating sequences under the site-heterogeneous (SH1) model. SH1 assumes site-heterogeneous codon frequencies generated from observed frequencies in coding genes from mammal species. All guide trees and model parameters (M0 and M3) are provided in the file.
- The two Python scripts: "generate_control_M0_SH2.py" and "generate_control_M3_SH2.py" create control files for indelible with the SH2 model. In particular, it converts the amino acid frequencies from the mixture models C10-C60 into equivalent codon frequencies, i.e. the frequency of each codon is calculated by the frequency of the amino acid divided by the number of synonymous codons for the amino acid and multiplied by the nucleotide frequency for the nucleotide at the 3rd codon position. It then formats the indelible control file accordingly. All guide trees are available in each of the scripts.
Syntax for running the script:
./generate_control_M[0 or 3].py [AA mixture model: C10-C60] [frequency C] [frequency T] [frequency A] [ALIGNMENT LENGTH]
An example output of the generate_control_M0.py is also provided: example_control_file_SH2.txt.
- The equivalent scripts "generate_control_M3_BSH.py" and "generate_control_M3_BSH.py" as in SH2 but for the branch-site model.
Syntax for running the script:
./generate_control_M[0 or 3].py [AA mixture model: C10-C60] [frequency C 1] [frequency T 1] [frequency A 1] [frequency C 2] [frequency T 2] [frequency A 2] [ALIGNMENT LENGTH]
The two sets of frequencies are assigned to different parts of the tree.
- For the metazoa analyses the following files are included: 1) "metazoa_alignment.CODONS.clean2.fasta.tar.gz" the DNA alignment for the 22 animal species used in the study, and 2) two scripts: "generate_control_M3_BSH-metazoa.py" and "generate_control_M3_SH2-metazoa.py" that generate indelible control files under the BSH and the SH2 models correspondingly and with guide trees matching the metazoa phylogenies A and C of Figure 3 in the manuscript.