Data from: Avoiding impacts of phylogenetic tip-state-errors on dispersal and extirpation rates in alpine plant biogeography
Data files
Sep 30, 2024 version files 330.15 MB
-
data.zip
79.95 KB
-
output.zip
330.02 MB
-
README.md
3.18 KB
-
scripts.zip
46.60 KB
Abstract
Aim: Many biogeographic analyses require some form of automated state assignment to tips of phylogenetic trees, reflecting a species presence or absence in a particular area, e.g., a biome. As datasets get exponentially larger, such procedures may increasingly induce errors (here called tip-state-error), but the specific algorithmic cause and consequence on downstream estimation of dispersal and extinction rates remains poorly known. We aim to improve automated tip-scoring methods in the context of the alpine biome by leveraging elevation information. We document the profound effect of tip-state-errors on Dispersal-Extirpation-Cladogenesis (DEC) models.
Location: The European Alpine Arc.
Taxon: 3’317 vascular plant species, emphasizing six focal clades: Campanula, Carex, Festuca, Ranunculus, Saxifraga, and Viola.
Methods: We use GBIF data to classify whether species occur above the upper climatic treeline using a newly developed algorithm ElevDistr or a gridded landscape model of thermal belts, under various filtering thresholds. We compared classification performance using the Flora Alpina as validation data. To determine if tip-state-error biases the dispersal and extirpation rate estimation, we fit DEC models for selected clades using tip-states from different classification models.
Results: ElevDistr is less error prone than other approaches. Filtering thresholds lower the false positive rate but increase the false negative rate. Inflated false positive rates bias the dispersal rate estimation upward, while inflated false negative rates lead to upward bias in extirpation rate estimation.
Main conclusions: Even moderate tip-state-error may lead to profound systematic bias in dispersal and extinction rate estimation if an unbalanced ratio between false positive and false negative rates occurs. Therefore, careful validation is imperative, though ElevDistr alleviates this problem in the context of the alpine environment. Overall, our results document contrasting rates of alpine biome shifts across the studied genera and have major implications for studies addressing the likelihood of niche evolution versus geographic dispersal.
This data archive contains the input data, the scripts, and the output data to run 24 Dispersal-Extirpation-Cladogenesis (DEC) models in RevBayes. Detailed description of how GBIF data was processed to receive the input data, can be found in the sections 2.2 2.4 of the manuscript and processing of the final log files in section 2.6.
Description of the data and file structure
data.zip - This archive contains the directory holding all the input data. And contains the following element:
- a_classification.csv: contains all the classification of the 3317 species. Columns presenting species, number of available occurrence observations, proportion of species classified as alpine respectively non-alpine by the Flora Alpina. Followed by the proportion of species that are classified as alpine or non-alpine by the polygon method or ElevDistr.
- b_oneTwoTree_output: this folder contains for every of the six genera the phylogeny from the pipeline oneTwoTree.
- The files GENUS_METHOD.nex are the input files for the DEC model generated by hand from a_classification.csv.
scripts.zip - This archive contains the scripts used for processing the data. And contains the following element:
- 1_date_tree.r: A r script used to convert the phylogram (oneTwoTree output) into a chronogram. The script uses as an input the files data/b_oneTwoTree_output/GENUS/Result_Tree_1XXXXXXXXX.tre. An generates the output files output/GENUS.tre
- 2_array.sh: Is a bash script that executes a array job. The script is coded to run on the sciCORE high performance cluster from the University of Basel. It runs the XX_simple_run.Rev scripts in a array.
- The files XX_simple_run.Rev contain the code to run one DEC model. It represents the combination of all six genera and the four classification methods. Input files are: data/ GENUS_METHOD.nex and output/GENUS.tre. The MCMC results are stored in output/ GENUS_METHOD.params.log
output.zip - This archive contains the directory holding all the output data. And contains the following element:
- GENUS.tre: Contains the time calibrated phylogeny of every genus.
- GENUS_METHOD.params.log: These files contain the DEC model output from RevBayes. Every log file contains Iteration: posterior, likelihood, prior, extirpation rate and dispersal rate, sampled during the MCMC run.
Sharing/Access information
Here I link to the data sources or the pipeline to collect the data
Links to other publicly accessible locations of the data:
- “Raw” GBIF data: https://www.gbif.org/occurrence/download/0262521-210914110416597
Data was derived from the following sources:
- Flora Alpina: Aeschimann, D., Lauber, K., Moser, D. M., & Theurillat, J. P. (2004). Flora alpina: atlas des 4500 plantes vasculaires des Alpes. Belin.
- OneTwoTree: Drori, M., Rice, A., Einhorn, M., Chay, O., Glick, L., & Mayrose, I. (2018). OneTwoTree: An online tool for phylogeny reconstruction. Molecular Ecology Resources, 18(6), 1492-1499.