The ribosomal RNA encapsulates a wealth of evolutionary information, including genetic variation that can be used to discriminate between organisms at a wide range of taxonomic levels. For example, the prokaryotic 16S rDNA sequence is very widely used both in phylogenetic studies and as a marker in metagenomic surveys and the ITS region, frequently used in plant phylogenetics, is now recognised as a fungal DNA barcode. However, this widespread use does not escape criticism, principally due to issues such as difficulties in classification of paralogous versus orthologous rDNA units and intragenomic variation, both of which may be significant barriers to accurate phylogenetic inference. We recently analysed datasets from the Saccharomyces Genome Resequencing Project, characterising rDNA sequence variation within multiple strains of the baker's yeast Saccharomyces cerevisiae and its nearest wild relative Saccharomyces paradoxus in unprecedented detail. Notably, both species possess single locus rDNA systems. Here, we use these new variation datasets to assess whether a more detailed characterisation of the rDNA locus can alleviate the second of these phylogenetic issues, sequence heterogeneity, while controlling for the first. We demonstrate that a strong phylogenetic signal exists within both datasets and illustrate how they can be used, with existing methodology, to estimate intra-species phylogenies of yeast strains consistent with those derived from whole-genome approaches. We also describe the use of partial Single Nucleotide Polymorphisms, a type of sequence variation found only in repetitive genomic regions, in identifying key evolutionary features such as genome hybridisation events and show their consistency with whole-genome Structure analyses. We conclude that our approach can transform rDNA sequence heterogeneity from a problem to a useful source of evolutionary information, enabling the estimation of highly accurate phylogenies of closely related organisms, and discuss how it could be extended to future studies of multi-locus rDNA systems.
Appendix 1 - S. paradoxus Variation Table
pSNP and SNP frequencies for SGRP sequence reads of 26 S. paradoxus strains and the S288c S. cerevisiae strain compared with the rDNA consensus sequence of the CBS432 S. paradoxus type strain.
West_rDNA_Appendix_1.xlsx
Appendix 2 - S. cerevisiae Variation Table
pSNP and SNP frequencies for SGRP sequence reads of 34 S. cerevisiae strains and the Q32.3 S. paradoxus strain compared with the rDNA consensus sequence of the S288c S. cerevisiae type strain.
West_rDNA_Appendix_2.xlsx
Appendix 3 - Phylogenetic Networks of S. paradoxus and S. cerevisiae Strains
Both a) and b) show an enlargement of the main population structure in the network, with the small boxed inset showing the whole network including the outgroup. a) The S. paradoxus network shows a clear separation of each geographic population. b) The S. cerevisiae network shows a more complex network structure, consistent with our knowledge of this population.
West_rDNA_Appendix_3.pdf
Appendix 4 - Bar Charts of the pSNP Percentage Occupancy in S. cerevisiae by Population Type
a) Bar chart of the S. cerevisiae structured strains, with number of pSNPs against the pSNP occupancy. The boxed section highlights pSNPs with occupancies greater than 10% and less than 90%. The Malaysian, North American and West African strains have very few pSNPs within this boxed area, and these are denoted as structured clean strains. Those strains with a number of pSNPs within this boxed area show a degree of mosaicism, and we classify these strains as being structured mosaic strains.
b) Bar chart of S. cerevisiae mosaic strains, where there are a large number of pSNPs within the 10% to 90% occupancy range.
West_rDNA_Appendix_4.pdf
S. paradoxus CE distance matrix
Cavalli-Sforza and Edwards rDNA-based distance matrix for 26 S. paradoxus strains plus S. cerevisiae strain S288c
S_paradoxus_CE_Dist.nex
S. cerevisiae CE distance matrix
Cavalli-Sforza and Edwards rDNA-based distance matrix for 34 S. cerevisiae strains plus S. paradoxus strain Q32.3
S_cerevisiae_CE_Dist.nex
S. paradoxus NJ tree
Neighbor-Joining phylogenetic tree derived from the S. paradoxus CE distance matrix
S_paradoxus_tree.nex
S. cerevisiae NJ tree
Neighbor-Joining phylogenetic tree derived from the S. cerevisiae CE distance matrix
S_cerevisiae_tree.nex
Perl script for coverage/copy number estimation
Perl script to calculate the rDNA unit coverage from a sequence read dataset and to estimate the number of rDNA units (copy number) in an rDNA tandem array.
coverage_v2.pl
Updated Appendix 3 - Phylogenetic Networks of S. paradoxus and S. cerevisiae Strains
Both a) and b) show an enlargement of the main population structure in the network, with the small grey inset showing the whole network including the outgroup. a) The S. paradoxus network shows a clear separation of each geographic population. b) The S. cerevisiae network shows a more complex network structure, consistent with our knowledge of this population.
West_rDNA_Appendix_3.svg
S. paradoxus CE distance matrix
Cavalli-Sforza and Edwards rDNA-based distance matrix for 26 S. paradoxus strains plus S. cerevisiae strain S288c
S_paradoxus_CE_dist.nex
Updated S cerevisiae CE distance matrix
Cavalli-Sforza and Edwards rDNA-based distance matrix for 34 S. cerevisiae strains plus S. paradoxus strain Q32.3
S_cerevisiae_CE_dist.nex
Updated S. paradoxus NJ tree
Neighbor-Joining phylogenetic tree derived from the S. paradoxus CE distance matrix
S_paradoxus_tree.nex
Updated S. cerevisiae NJ tree
Neighbor-Joining phylogenetic tree derived from the S. cerevisiae CE distance matrix
S_cerevisiae_tree.nex