Inferring the evolutionary model of community-structuring traits with convolutional kitchen sinks: Code and data
Data files
Jul 10, 2023 version files 18.33 MB
-
angiocomm.rds
-
angiompd.rds
-
README.md
May 06, 2024 version files 18.33 MB
-
angiocomm.rds
-
angiompd.rds
-
README.md
Abstract
When communities are assembled through processes such as filtering or limiting similarity acting on phylogenetically conserved traits, the evolutionary signature of those traits may be reflected in patterns of community membership. We show how the model of trait evolution underlying community-structuring traits can be inferred from community membership data using both a variation of a traditional eco-phylogenetic metric--the mean pairwise distance (MPD) between taxa--and a recent machine learning tool, Convolutional Kitchen Sinks (CKS). Both methods perform well across a range of phylogenetically informative evolutionary models, but CKS outperforms MPD as tree size increases. We demonstrate CKS by inferring the evolutionary history of freeze tolerance in angiosperms. Our analysis is consistent with a late burst model of freeze tolerance, suggesting it evolved recently. We suggest that data ordered on phylogenies such as trait values, species interactions, or community presence/absence are good candidates for CKS modeling because the generative models produce structured differences between neighboring points that CKS is well-suited for. We introduce the R package kitchen to perform CKS for generic application of the technique.
README: Inferring the Evolutionary Model of Community-Structuring Traits with Convolutional Kitchen Sinks: code and data
This README file was generated on 2024-05-03 by Avery Michael Kruger.
GENERAL INFORMATION
Title of Dataset: Inferring the Evolutionary Model of Community-Structuring Traits with Convolutional Kitchen Sinks: code and data
Corresponding Investigator
Name: Avery Michael Kruger
Institution: University of British Columbia
Email: avery.kruger@botany.ubc.ca
Co-investigator
Name: Vaishaal Shankar
Institution: Apple Inc.
Former Institution: Amazon.com, Inc.
Co-investigator
Name: Jonathan Davies
Institution: University of British Columbia
SHARING/ACCESS INFORMATION
Data was derived from the following sources:
- Zanne, Amy E. et al. (2014), Data from: Three keys to the radiation of angiosperms into freezing environments, Dryad, Dataset, https://doi.org/10.5061/dryad.63q27
Recommended citation for this dataset:
Kruger, Avery; Shankar, Vaishaal; Davies, Jonathan (2024), Inferring the evolutionary model of community-structuring traits with convolutional kitchen sinks: Code and data, Dryad, Dataset, https://doi.org/10.5061/dryad.zw3r2289q
DATA & FILE OVERVIEW
Description:
These files include Supplementary Figures for Kruger et al., as well as code and archived simulation data necessary to replicate the figures and results of Kruger et al. These analyses were performed to investigate the ability of two methods, a machine learning technique termed Convolutional Kitchen Sinks (CKS) and models trained on series of Mean Pairwise Distance (MPD) metrics, also termed MPD curve, to recover the evolutionary model of traits that communities are assembled on. Communities were simulated on both simulated and empirical phylogenies by evolving traits on the phylogenies according to an Early Burst transformation governed by a normally distributed evolutionary parameter. Data were separated into training and test data. The evolutionary parameters used in simulation were then modeled as a function of the observed simulated communities in the training data, using both the CKS method and MPD method. The models were then tested by examining the relationship between the predicted and known parameters of the test data. Finally, predictions using trained models were made on the known community of freeze-tolerant dicots.
File List:
File 1 Name: SupplementalFigures.pdf
File 1 Description: This file contains Supplementary Figures S1-S7.
File 2 Name: source_functions.R
File 2 Description: This code contains functions called in various scripts. It is called by scripts as needed, so it is not necessary to run this code independently.
File 3 Name: 1_setup.R
File 3 Description: This code creates folders `data`, `data/angiosim`, `plots`, and `output` in the working directory if they do not already exist, and moves angiocomm.rds and angiompd.rds into the data/angiosim folder. This file also contains code to install PhyloMeasures 2.1, an archived package on CRAN, which contains a necessary function, and code to install kitchen (avery-kruger/kitchen) from GitHub .
File 4 Name: 2_community_sims.R
File 4 Description: This code simulates phylogenies and communities for later analysis. Writes to `data` folder and creates it if it does not already exist.
File 5 Name: 3_community_cks.R
File 5 Description: This code tests how well CKS and MPD curve methods perform at predicting parameters used to evolve traits that communities are assembled upon. This file produces Figure 3 of the manuscript.
File 5 Dependency: 2_community_sims.R must be run first to generate data.
File 6 Name: 4_dicottree_trim.R
File 6 Description: This code takes a phylogeny from Zanne et al. 2014 and trims it to contain only species in Magnoliopsida for which freezing data in MinimumFreezingExposure.csv is present.
File 6 Dependency 1: Requires Vascular_Plants_rooted.dated.tre from Zanne et al. 2014
File 6 Dependency 2: Requires MinimumFreezingExposure.csv from Zanne et al. 2014
File 7 Name: 5_dicot_sims.R
File 7 Description: This code simulates communities and MPD curves on a dicot phylogeny.
File 7 Dependency: Requires Zanne.angiosperm.tre, which is generated by 4_dicottree_trim.R.
File 8 Name: 6_dicot_cks.R
File 8 Description: This code trains CKS and MPD models on communities simulated on the Zanne phylogeny and checks predictions against known values. It then predicts an evolutionary parameter given the known community of freeze-tolerant plants. This file creates Figures 7, 8, and 9.
File 8 Dependency: Requires angiocomm.rds and angiompd.rds files in the data/angiosim folder. These already exist, but may be generated by running 5_dicot_sims.R.
File 9 Name: fig4_s2_s3_computetime.R
File 9 Description: This code runs kitchen_sweep across communities and superparameters of different sizes to demonstrate the relationship between those dimensions and computational time. Before running, either run RandTrees_simulations.R to generate data computetime.R uses or alter code to use other data. Plots are presented in Figure 4 and Supplementary Figures S2 and S3.
File 9 Dependency: Requires files beginning with 1024treesim_comm that are generated by running 2_community_sims.R.
File 10 Name: fig5_altassembly_varytraitcov.R
File 10 Description: This code examines how covariance of traits affects inference. Two plots are presented in Figure 5.
File 11 Name: fig6_s5_altassembly_diffalphas.R
File 11 Description: This code examines how different numbers of independent alphas affect performance of CKS. Plots are presented in Figure 6 and Supplementary Figure S5.
File 12 Name: s1_altassembly_ntraits.R
File 12 Description: This code explores how the number of traits affects inference. Plot is presented in Supplementary Figure S1.
File 13 Name: s4_altprediction_delta.R
File 13 Description: This code compares the performance of MPD curves generated on Early Burst- (EB) and delta-transformed phylogenies. Plot is presented in Supplementary Figure S4.
File 14 Name: s6_s7_altassembly_limsim.R
File 14 Description: This code examines the performance of CKS and MPD methods on communities assembled under limiting similarity. Plots are used in Supplementary Figures S6-S7.
File 15 Name: x_altassembly_varycommsize.R
File 15 Description: This code explores how community size affects performance of CKS. Data were not presented in the manuscript or Supplementary Figures.
File 16 Name: angiocomm.rds
File 16 Description: This RDS file contains the original simulations on the empirical phylogeny that were used in the manuscript. These data are used for training and testing a CKS model that predicts the evolutionary parameter describing evolution of traits the communities were assembled on. These data were originally generated with 5_dicot_sims.R.
File 17 Name: angiompd.rds
File 17 Description: This file contains a data frame of the MPD curve of the corresponding rows in angiocomm.rds. The MPD curve is a series of Mean Pairwise Distance metrics calculated a cross a series of transformations of the dicot phylogeny. These data are used to train and test a linear model that predicts the evolutionary parameter describing evolution of traits the communities were assembled on. These data were originally generated with 5_dicot_sims.R.
METHODOLOGICAL INFORMATION
Methods for processing the data:
1_setup.R should be run first to ensure folders are set up properly in the working directory and to install PhyloMeasures and kitchen. After that, 2_ and 3_ may be run together or 4_-6_ may be run together, as described below.
2_community_sims.R and 3_community_cks.R should be run sequentially.
4_ through 6_ require Vascular_Plants_rooted.dated.tre and MinimumFreezingExposure.csv, files that are available from Zanne et al. (2014), as described in Sharing/Access information.
4_dicottree_trim.R should be run first to create the appropriate phylogeny.
5_dicot_sims.R may be optionally run to simulate data. Files angiocomm.rds and angiompd.rds were generated with this code
6_dicot_cks.R can then be run to perform analyses.
Files 9-14 were used to produce figures for Sensitivity of CKS to Alternative Models and Supplemental Figures. These scripts are all independent of each other and do not need to be run in any particular order.
angiocomm.rds was created with 5_dicot_sims.R. For each simulation, traits were evolved on the dicot phylogeny by rescaling the phylogeny with an Early Burst transformation using a normally distributed parameter with mean 0 and standard deviation 0.08. A random freezing-exposed species was chosen as an optimum, and then the 4,353 species closest in Euclidean trait space were selected as present in a community and assigned the value 1.
angiocomm.mpds was created with 5_dicot_sims.R. For each row in angiocomm.rds, the mean pairwise distance of the community was calculated on a series of delta transformations of the dicot phylogeny transformed by parameters ranging from 0.05 to 40.
Instrument- or software-specific information needed to interpret the data:
CRAN Packages:
ape 5.6-2
geiger 2.0.10
ggplot2 3.4.0
parallel 4.2.2
phytools 1.2-0
Packages not available on CRAN (see File 3 Description):
kitchen 0.1.0
PhyloMeasures 2.1
People involved with analysis: Avery Kruger
DATA-SPECIFIC INFORMATION FOR: angiocomm.rds
Number of variables: 9,850
Number of cases/rows: 5,000
Element List:
param: Evolutionary parameter used for evolution of traits
comm: Data frame of simulated communities. Contains 9,849 species binomial variables, which represent the presence or absence of a dicot species in each simulated community.
Missing data codes:
None
DATA-SPECIFIC INFORMATION FOR: angiompd.rds
Number of variables: 99
Number of cases/rows: 5,000
Variable List:
param: Evolutionary parameter used for evolution of traits
98 numeric variables 0.05, 0.1, ..., 40: Each numerically named variable represents the mean pairwise distance of each simulated community, calculated on a phylogeny transformed by a delta transformation where delta equals the variable name.
Missing data codes:
None
Methods
Data were simulated, processed, and analyzed in R. These analyses were performed to investigate the ability of two methods, a machine learning technique termed Convolutional Kitchen Sinks (CKS) and models trained on series of Mean Pairwise Distance (MPD) metrics, also termed MPD curve, to recover the evolutionary model of traits that communities are assembled on. Communities were simulated on both simulated and empirical phylogenies by evolving traits on the phylogenies according to an Early Burst transformation governed by a normally distributed evolutionary parameter. Data were separated into training and test data. The evolutionary parameters used in simulation were then modeled as a function of the observed simulated communities in the training data, using both the CKS method and MPD method. The models were then tested by examining the relationship between the predicted and known parameters of the test data. Finally, predictions using trained models were made on the known community of freeze-tolerant dicots.