Data from: Exploring tree-like and non-tree-like patterns using genome sequences: an example using the inbreeding plant species Arabidopsis thaliana (L.) Heynh.
Stenz, Noah, University of Wisconsin-Madison
Larget, Bret, University of Wisconsin-Madison
Baum, David A., University of Wisconsin-Madison
Ané, Cécile, University of Wisconsin-Madison
Published Jun 10, 2015 on Dryad.
Cite this dataset
Stenz, Noah; Larget, Bret; Baum, David A.; Ané, Cécile (2015). Data from: Exploring tree-like and non-tree-like patterns using genome sequences: an example using the inbreeding plant species Arabidopsis thaliana (L.) Heynh. [Dataset]. Dryad. https://doi.org/10.5061/dryad.q044d
Genome sequence data contain abundant information about genealogical history, but methods for extracting and interpreting this information are not yet fully developed. We analyzed genome sequences for multiple accessions of the selfing plant, Arabidopsis thaliana, with the goal of better understanding its genealogical history. As expected from accessions of the same species, we found much discordance between nuclear gene trees. Nonetheless, we inferred the optimal population tree under the assumption that all discordance is due to incomplete lineage sorting. To cope with the size of the data (many genes and many taxa), our pipeline is based on parallel computing and divides the problem into four-taxon trees. However, just because a population tree can be estimated does not mean that the assumptions of the multispecies coalescent model hold. Therefore, we implemented a new, nonparametric test to evaluate whether a population tree adequately explains the observed quartet frequencies (the frequencies of gene trees with each resolution of each four-taxon set). This test also considers other models: panmixia and a partially resolved population tree, that is, a tree in which some nodes are collapsed into local panmixia. We found that a partially resolved population tree provides the best fit to the data, providing evidence for tree-like structure within A. thaliana, qualitatively similar to what might be expected between different, closely related species. Further, we show that the pattern of deviation from expectations can be used to identify instances of introgression and detect one clear case of reticulation among ecotypes that have come into contact in the United Kingdom. Our study illustrates how we can use genome sequence data to evaluate whether phylogenetic relationships are strictly tree-like or reticulating.
Stratified Subsample Concordance Factors
Concordance factors as well as their 95% confidence intervals for all 27,405 possible 4-taxon sets of the stratified subsample. Note: if a 4-taxon set had a concordance factor greater than 0.95, the concordance factors of the minor resolutions were approximated to evenly divide the remaining proportion (i.e. (1 - majorCF) / 2). Since these values are approximate, the upper and lower 95% confidence intervals for these splits are listed as "NA". In newer versions of the pipeline, this issue is avoided by having BUCKy output all concordance factors regardless of how small they are.
PD Subsample Concordance Factors
Concordance factors as well as their 95% confidence intervals for all 27,405 possible 4-taxon sets of the PD subsample.
A. thaliana 4th Chromosome MDL Blocks
Gzipped archive containing the alignments of the 3,595 recombinational blocks as determined by MDL.
Github repository storing TICR scripts used in this paper.