Shared single copy genes are generally reliable for inferring phylogenetic relationships among polyploid taxa
Data files
Nov 13, 2023 version files 286.59 MB
-
AtAlpha_CDS.tar.gz
-
AtAlpha_GeneTrees.tar.gz
-
AtAlpha_nexus.tar.gz
-
AtAlpha_singlecopy_concat.nex
-
AtAlpha_singlecopy_partitions.txt
-
README.md
-
TGD_CDS.tar.gz
-
TGD_GeneTrees.tar.gz
-
TGD_nexus.tar.gz
-
TGD_singlecopy_concan_partitions.txt
-
TGD_singlecopy_concan.nex
-
Yeast5spp_nexus.tar.gz
-
yeastWGD_concat_singlecopy_partitions.txt
-
YeastWGD_GeneTrees.tar.gz
-
YeastWGD_nexus.tar.gz
-
YeastWGD_singlecopy_concat.nex
Mar 15, 2024 version files 341.83 MB
-
AtAlpha_CDS.tar.gz
-
AtAlpha_GeneTrees.tar.gz
-
AtAlpha_nexus.tar.gz
-
AtAlpha_POInT_orthology_data.txt
-
AtAlpha_singlecopy_concat.phy
-
AtAlpha_singlecopy_partitions.txt
-
list_all_singlecopy_topologies.pl
-
README.md
-
run_all_trees.pl
-
TGD_CDS.tar.gz
-
TGD_GeneTrees.tar.gz
-
TGD_nexus.tar.gz
-
TGD_POInT_orthology_data.txt
-
TGD_singlecopy_concan_partitions.txt
-
TGD_singlecopy_concan.phy
-
Yeast5spp_nexus.tar.gz
-
yeastWGD_concat_singlecopy_partitions.txt
-
yeastWGD_concat_singlecopy.phy
-
YeastWGD_GeneTrees.tar.gz
-
YeastWGD_nexus.tar.gz
-
YeastWGD_POInT_orthology_data.txt
Abstract
Polyploidy, or whole-genome duplication, is expected to confound the inference of species trees with phylogenetic methods for two reasons. First, the presence of retained duplicated genes requires the reconciliation of the inferred gene trees to a proposed species tree. Second, even if the analyses are restricted to shared single copy genes, the occurrence of reciprocal gene loss, where the surviving genes in different species are paralogs from the polyploidy rather than orthologs, will mean that such genes will not have evolved under the corresponding species tree and may not have gene trees that allow inference of the species tree. Here we analyze three different ancient polyploidy events, using synteny-based inferences of orthology and paralogy to infer gene trees from more than 17,000 sets of homologous genes. We find that the simple use of single copy genes from polyploid organisms provides reasonably robust phylogenetic signals, despite the presence of reciprocal gene losses. Such gene trees are also most often in accord with the inferred species relationships inferred from maximum likelihood models of gene loss after polyploidy: a completely distinct phylogenetic signal present in these genomes. As seen in other studies, however, we find that methods for inferring phylogenetic confidence yield high support values even in cases where the underlying data suggest meaningful conflict in the phylogenetic signals.
README: Shared single copy genes are generally reliable for inferring phylogenetic relationships among polyploid taxa
https://doi.org/10.5061/dryad.7d7wm3821
Description of the data and file structure
These data allow for the replication of the phylogenetic analyses presented in the associated manuscript. We proved data for three different polyploidy events: the At-Alpha, TGD and YeastWGD events.
For each event, we provide six (6) types of file:
1) Codon-preserving nucleotide alignments for all sets of WGD-derived homologous genes for each event (aka "Pillars"), in NEXUS format. Phylogenetic trees can be produced from each files using PAUP* with parameters noted in the manuscript. Pillar numbers are 7243 for AtAlpha (AtAlpha_nexus.tar.gz), 5589 for TGD ( TGD_nexus.tar.gz) and 4065 for the yeast WGD (YeastWGD_nexus.tar.gz). Format: GZIPed TAR files.
2) Coding region sequences for the same pillars as #1. If these are translated and aligned with T-coffee, the coding-preserving alignments of #1 will be produced. Files are AtAlpha_CDS.tar.gz and TGD_CDS.tar.gz (Coding sequences for the yeast data are not included but are available from the Yeast Gene Order Browser page: http://ygob.ucd.ie). Format: GZIPed TAR files, with FASTA-formatted sequences for each pillar.
3) POInT-inferred gene trees for all pillars in #1&2. POInT uses the assumed species tree and synteny relationships to "prune" lost duplicate genes and produce a gene tree that accords with the synteny relationships and missing genes. These are then provided as Newick-formatted trees, one per pillar corresponding to the pillars in 1&2. These assumed gene trees can then be compared to those inferred with PAUP from the NEXUS files in #1 using the R packages described in the manuscript. Files are AtAlpha_GeneTrees.tar.gz, TGD_GeneTrees.tar.gz, and YeastWGD_GeneTrees.tar.gz. Format: GZIPed TAR files.
4) A concatenated PHYLIP nucleotide alignment of all purely single-copy genes from the pillars in 1, and suitable for analysis with a tool like RAxML. Files are AtAlpha_singlecopy_concat.phy, TGD_singlecopy_concan.phy and YeastWGD_singlecopy_concat.phy. Format: Single PHYLIP files
6) A "partition" file, giving the gene coordinates in #4, such that one can allow different model parameters for each gene when analyzing #4 with RAxML. Files are AtAlpha_singlecopy_partitions.txt, yeastWGD_concat_singlecopy_partitions.txt and TGD_singlecopy_concan_partitions.txt. Format: Single, tab-delimited text files.
7) POInT predicted orthology files for each event. Each pillar in the dataset is represented as a line in this file, with the orthology confidence c given in column 2. The gene for the first sub genome in each species is given first, then that for the second subgenome. "NONE" indicates a gene loss for that species at that position.
In addition, we provide a single file (Yeast5spp_nexus.tar.gz) comprising the trimmed 5 species yeast alignments used for the restricted analysis of 5 species of yeast in the manuscript.
Code/Software
The POInT software is freely available at https://github.com/gconant0/POInT
Included here are two scripts for generating the analysis data: run_all_trees.pl and list_all_single_copy_topologies.pl
INSTRUCTIONS FOR RUNNING SCRIPTS:
run_all_trees.pl
DEPENDANCIES:
1) T_coffee
2) PAUP
3) Gavin Conant's seq_tools package (https://github.com/gconant0/seq_tools)
RUNNING
- It should be run from a directory containing the CDS files to be analyzed named as Pillar#_CDS.fas. These files are included in this dataset.
- Command
./run_all_trees.pl
list_all_single_copy_topologies.pl
DEPENDANCIES:
1) R and the R TREEDIST package (requires the APE package)
2) The PERL R interpreter
RUNNING
- It should be run after run_all_trees.pl from the same directory
- It requires 3 arguments:
1) A POInT orthology file (#4 above)
2) A POInT confidence cutoff (Between 0.5 and 1.0: 0.9 is a good choice)
3) An output file
- Command
./list_all_single_copy_topologies.pl AtAlpha_POInT_orthology_data.txt 0.9 AtAlpha_SingleCopy09_TreeHist.txt
Sharing/Access information
All of these data are offered under the standard CC0 license and may be copied and redistributed as needed.
Methods
Data on syntentic orthologs were obtained from the POInT browse webs server (wgd.statgen.ncsu.edu) and derived from the papers cited there and in the manuscript.