Testing relationships between multiple regional features and biogeographic processes of speciation, extinction, and dispersal
Data files
Jul 10, 2023 version files 513.36 MB
-
data.zip
4.44 MB
-
output.zip
508.91 MB
-
README.md
11.33 KB
Jul 23, 2024 version files 328.67 MB
-
code.zip
33.41 KB
-
data.zip
8.07 MB
-
output.zip
320.55 MB
-
README.md
12.69 KB
Oct 17, 2024 version files 329.42 MB
-
code.zip
33.53 KB
-
data.zip
8.07 MB
-
output.zip
321.31 MB
-
README.md
12.80 KB
Abstract
The spatial and environmental features of regions where clades are evolving are expected to impact biogeographic processes such as speciation, extinction, and dispersal. Any number of regional features (such as altitude, distance, area, etc.) may be directly or indirectly related to these processes. For example, it may be that distances or differences in altitude or both may limit dispersal rates. However, it is difficult to disentangle which features are most strongly related to rates of different processes. Here, we present an extensible Multi-feature Feature-Informed GeoSSE (MultiFIG) model that allows for the simultaneous investigation of any number of regional features. MultiFIG provides a conceptual framework for incorporating large numbers of features of different types, including categorical, quantitative, within-region, and between-region features, along with a mathematical framework for translating those features into biogeographic rates for statistical hypothesis testing. Using traditional Bayesian parameter estimation and reversible-jump Markov chain Monte Carlo, MultiFIG allows for the exploration of models with different numbers and combinations of feature-effect parameters, and generates estimates for the strengths of relationships between each regional feature and core process. We validate this model with a simulation study covering a range of scenarios with different numbers of regions, tree sizes, and feature values. We also demonstrate the application of MultiFIG with an empirical case study of the South American lizard genus Liolaemus, investigating sixteen regional features related to area, distance, and altitude. Our results show two important feature-process relationships: a negative distance/dispersal relationship, and a negative area/extinction relationship. Interestingly, although speciation rates were found to be higher in Andean versus non-Andean regions, the model did not assign significance to Andean- or altitude-related parameters. These results highlight the need to consider multiple regional features in biogeographic hypothesis testing.
README
Testing relationships between multiple regional features and biogeographic processes of speciation, extinction, and dispersal
This dataset is associated with Swiston & Landis 2023 (https://doi.org/10.1101/2023.06.19.545613). It contains two major elements. First is a simulation study performed using R and RevBayes associated with the validation of the MultiFIG model. Here, we provide simulated regional feature values, simulated phylogenetic trees and present-day species ranges, and output from analyses with the MultiFIG model. Second is an empirical analysis of the Liolaemus genus under the MultiFIG model. We also provide figures and supplemental figures associated with the manuscript.
Description of the data and file structure
Overview: The code .zip file contains all of the scripts necessary for generating and analyzing simulations for the simulation study (sim), running the empirical analysis of Liolaemus (emp), and plotting the results of both analyses (plotting). The data .zip file contains all of the simulated datasets for the simulation study in sim and empirical dataset for the analysis of Liolaemus in emp. The output .zip file contains all of the unprocessed and processed output from the simulation analyses (sim) and empirical analysis (emp), as well as all plots and supplemental figures from the associated manuscript (plots).
code
Contains R and RevBayes code for MultiFIG simulation study and empirical analysis of Liolaemus, as well as plotting scripts.sim
Contains scripts for generating and analyzing simulated datasets under the MultiFIG modelbatchsim.sh
Shell script for submitting simulations using an LSF job scheduler (not required if simulating locally)sim.sh
Shell script that calls separate scripts for simulating geographies and trees under the MultiFIG modelgeosim.R
R Script for simulating "geographies", generating .csv data files indata/sim/geo
representing simulated regional features and feature summariessim.Rev
RevBayes script for generating phylogenetic trees, tip states, and model parameters indata/sim/history
according to the MultiFIG model (based on simulated geographies)batchinf.sh
Shell script for submitting inference jobs on simulated datasets using an LSF job schedule (not required if performing inference locally)inf.sh
Shell script that callsinf.Rev
on simulated datasetsinf.Rev
RevBayes script for performing inference on simulated datasets under the MultiFIG modelemp
Contains RevBayes scripts for performing an analysis of Liolaemus under the MultiFIG modelbatchinf.sh
Shell script for submitting inference jobs on Liolaemus dataset using an LSF job scheduler (not required if performing analysis locally)inf.sh
Shell script that callsinf.Rev
on Liolaemus datasetinf.Rev
RevBayes script for performing inference on Liolaemus dataset under the MultiFIG modelplotting
Contains R scripts for plotting results of simulation study and empirical analysis of Liolaemuscov_plots.R
R script for generating coverage plots and coverage table associated with the simulation studyjoint_plots.R
R script for generating joint posterior plots associated with the analysis of Liolaemusposterior_plots.R
R script for generating plots of Bayesian posteriors associated with the analysis of Liolaemusrj_sim_plots.R
R script for plotting reversible jump results associated with the simulation studystate_plots.R
R script for plotting the ancestral state reconstruction for Liolaemusvariance_plots.R
R script for plotting the variance of the empirical features against the variances of the simulated features
data
Contains data for simulation study and empirical analysis of Liolaemussim
Contains data associated with the simulation studygeo
Contains data files for regional features over a set of simulated geographies, as well as summary files which describe the features used in each analysis (some analyses use different features in simulation versus inference)- The first element of the filename represents whether the analysis will be performed using reversible-jump (RJ) or without (NONRJ)
- The second element of the filename represents the number of regions
- The third element represents the experimental condition of the analysis (all 12 features are generated, but some are not used)
- LESS: area & distance during simulation; area, distance, & altitude during inference
- FULL: area, distance, & altitude during simulation; area, distance, & altitude during inference
- MORE: area, distance, altitude, & temperature during simulation; area, distance, & altitude during inference
- NOISY: area, distance, & taltitude (true altitude) during simulation; area, distance, & altitude during inference
- The fourth element represents the tree size category: 25-49 = XSMALL, 50-99 = SMALL, 100-199 = MEDIUM, 200-349 LARGE (this information is not used in simulating geographies, but will be used in simulating trees)
- The fifth element represents the index of the simulated geography (randomly assigned)
SIM_feature_summary.csv
files contain a list of features to be used in the associated simulationINF_feature_summary.csv
files contain a list of features to be used in the associated inference- For all feature files:
- The fifth element represents the type of data (c=categorical, q=quantitative, w=within-region, b=between-regions)
- The sixth element is the geographical feature that the data represents: area, distance, altitude, true altitude, or temperature
- Eg.
3.FULL.MEDIUM.1.cb_distance.csv
uses 3 regions, will use the feature set "FULL" when simulating a "MEDIUM" tree, is the simulated geography indexed 1, and contains a matrix of categorical distances between regions -- adjacency matrix
history
Contains data files for simulated trees, tip states, and model parameters- Eg.
RJ.3.FULL.MEDIUM.1.tree.tre
Contains a Newick-string representation of a dated phylogeny, simulated based on the 3-region geography indexed 1 using the feature set "FULL", and targeting a medium tree size, to be analyzed using reversible jump - Eg.
RJ.3.FULL.MEDIUM.1.data.tsv
Contains tip states corresponding to theRJ.3.FULL.MEDIUM.1
simulation - Eg.
RJ.3.FULL.MEDIUM.1.param.txt
Contains MultiFIG model parameters that produced theRJ.3.FULL.MEDIUM.1
simulation
- Eg.
emp
Contains empirical dataset associated with Liolaemus and 6 South American regions (AA = Altiplanic Andes, CA = Central Andes, PA = Patagonia, CC = Central Chile, AD = Atacama Desert, EL = Eastern Lowland)history
Contains the empirical dataset relating to Liolaemusliolaemidae.data.full.csv
Contains information about each species in the family Liolaemidae (including the genus Liolaemus), such as presence/absence in different regions, as well as other species traits that are not used for the MultiFIG analysis -- data from Esquerré et al. 2019make_dat.py
A Python script for translating the data fromliolaemidae.data.full.csv
into data that is usable for the MultiFIG model (liolaemidae.data.table.tsv
andranges.data.tsv
)state_labels_n6.txt
Relates binary presence/absence data to integer state numbers used by RevBayes for the MultiFIG analysis, used by themake_dat.py
scriptliolaemidae.data.table.tsv
Contains present-day ranges of Liolaemus, representing presence in a region with (1) and absence with (0)ranges.data.tsv
Contains state numbers for present-day Liolaemus speciestree.mcc.tre
Time-calibrated phylogeny of Liolaemus
geo
Contains the empirical dataset relating to South Americashapefiles
Contains shapefiles associated with the 6 regions used in the Liolaemus analysis, are not required for running the analysisRJ.6.HL.LIOLAEMUS.feature_summary.csv
Contains information about which features to use for the empirical analysis using highland/lowland classificationaltitudes
Contains regional features associated with altitudeandean_classification.csv
Classifying regions as Andean or non-Andeanandean_sameness.csv
Matrix describing whether regions share (1) or do not share (0) Andean classificationclassification.csv
Altitude classification of regions (1=high, 0=low)mean.csv
Mean altitudes of regions (m)mean_diff.csv
Differences in mean altitude between regions (m)sameness.csv
Matrix describing whether regions share (1) or do not share (0) altitude classificationsd.csv
Standard deviation of altitudes of regionsareas
Contains regional features associated with areaareas.csv
Sizes of regions (km^2)classification.csv
Size classification of regions (1=large, 0=small)distances
Contains regional features associated with distancemean.csv
Mean distances between regions (km)adjacency.csv
Matrix of region adjacencyequal
Contains vectors/matrices of equal features values for simplified analyses (removing feature effects)- Eg.
q_equal_vector.csv
A vector of equal feature values for simplified analysis -- in this case, a quantitative one-dimensional vector
output
Contains output of simulated and empirical analysessim
Contians output of simulation studyoutput
Output of simulation analyses (logfiles for model parameters)data
Summaries of output for each analysis (estimates and HPD intervals)processed_data
Contains large output file of the coverage analysis,coverages.csv
emp
Contains output of empirical analysis of Liolaemus, including 6 file types:.ase.tre
(ancestral state tree),.states.log
(ancestral state trace),.stoch.log
(stochastic mapping),.events.tsv
(list of state transitions and cladogenetic events),.model_extras.log
(by-region rates of biogeographic processes), and.model.log
(logfile of model parameters)concatenated.model.log
Logfile of model parameters used for results in the manuscript; created by concatenating the output of analyses 8 and 9 (numbering was arbitrary), after removing burnin -- done to ensure sufficient number of generationsoutput
Contains other output files associated with analyses 8 and 9, including the ancestral state reconstruction generated from analysis 9
plots
Contains plots associated with Swiston & Landis 2023 (https://doi.org/10.1101/2023.06.19.545613)
Sharing/Access information
Links to other publicly accessible locations of the data:
Data was derived from the following sources:
- Esquerré, D., Brennan, I. G., Catullo, R. A., Torres-Pérez, F., & Keogh, J. S. (2019). How mountains shape biodiversity: The role of the Andes in biogeography, diversification, and reproductive biology in South America’s most species-rich lizard radiation (Squamata: Liolaemidae). Evolution; International Journal of Organic Evolution, 73(2), 214–230.
- NASA. (2013). Shuttle Radar Topography Mission (SRTM) Global [Data set]. https://doi.org/10.5069/G9445JDF
- Swiston, S. K., & Landis, M. J. (2023). Testing relationships between multiple regional features and biogeographic processes of speciation, extinction, and dispersal. BioRxiv. https://doi.org/10.1101/2023.06.19.545613
Code/Software
- The dataset contains .R files for generating simulated geographies and plotting output. These files were designed to be run using R version 4.4.0.
- The dataset also contains .Rev files for performing analyses in RevBayes. Correct versions of RevBayes and TensorPhylo can be found in Docker image
sswiston/rb_tp:7
https://hub.docker.com/r/sswiston/rb_tp.
For a tutorial explaining the details of the MultiFIG analysis, visit https://revbayes.github.io/tutorials/multifig/.
Version Changes:
- 2024/07/22: Due to changes in RevBayes and TensorPhylo software, all scripts were overhauled, all analyses re-run, and all figures re-generated. New .zip files have been uploaded.
- 2024/10/17: The term 'altitude' was changed to 'elevation' throughout the plotting scripts and figures.
Methods
This dataset contains a simulation study performed using R and RevBayes. It consists of simulated regional feature values, simulated phylogenetic trees and present-day species ranges, and output from analyses with the MultiFIG model. The dataset also contains files relevant to a MultiFIG analysis of Liolaemus. Finally, the dataset contains PDF versions of figures and supplemental figures.
Usage notes
The data files can be opened with any text editor. For visualization, trees and logfiles can be opened in Tracer. Plots can be opened by any PDF reader.