Speciation, dispersal, and the build-up of fern diversity in the American tropics
Data files
Sep 22, 2025 version files 2.44 GB
-
Biogeographic_analyses_-_BioGeoBEARS.zip
2.44 GB
-
README.md
23.05 KB
Abstract
This dataset supports an investigation into the temporal dynamics of speciation and dispersal in ferns (Polypodiopsida) across the American tropics. It includes 378,115 georeferenced occurrence records compiled from GBIF, cleaned and taxonomically standardised using the 'CoordinateCleaner' and 'taxastand' R packages. Species were matched to a time-calibrated phylogenetic tree comprising 5850 fern species, from which 56 clades predominantly distributed in the American tropics were retrieved. These clades encompass 1530 species coded for presence/absence across nine biogeographically defined regions and a global category. Using biogeographical stochastic mapping (BSM) in BioGeoBEARS, we estimated the frequency and type of biogeographic events (e.g., dispersal, extinction, speciation) across 1-million-year time slices. The dataset also includes lineage-through-time (LTT) counts, regional species pool compositions at four time points (0, 5, 10, and 30 Ma), and pairwise compositional similarities calculated using the Jaccard index. These data provide a foundation for analysing historical biogeographic transitions and diversification patterns in tropical ferns, with emphasis on regional connectivity and the role of the Andes as a biodiversity source.
This DRYAD repository contains scripts for reconstructing the biogeographical history of ferns in the American tropics and output data, as presented in the paper "Speciation, dispersal and the build-up of fern diversity in the American tropics" in the scientific journal Ecography.
Before you start
To reconstruct our analyses, we used a combination of R Studio on a PC and a High Performance Computer (HPC). We recommend using an HPC due to the computational intensity of the analyses. The HPC we used was set up with Slurm Workload Manager.
Installation and startup
You need to install some programs on the HPC. We used the package manager Conda for installation.
-
Rstudio version 4.1.3 (RStudio Team 2023; http://www.rstudio.com/)
-
BioGeoBEARS (Matzke 2013; https://github.com/nmatzke/BioGeoBEARS?tab=readme-ov-file)
-
Note: BioGeoBEARS is not available through Conda. Follow this guide for installation: https://groups.google.com/g/biogeobears/c/AYNPfA202SI?pli=1, specifically Joel Nitta's comments.
To run step 2 of the data analysis 02_prepare_data.R, you need to download dependencies for the R package taxastand. See https://github.com/joelnitta/taxastand?tab=readme-ov-file.
Data analyses
Description of the data and file structure
The folder structure of Data_analyses.zip is as follows:
Data_analyses.zip
├── Data generation
│ ├── 01_prepare_tree.R
│ ├── 02_prepare_data.R
│ ├── 03_map_occ.records.R
│ ├── 04_cut_tree.R
│ ├── 05_prepare_BioGeoBEARS.R
│ └── Shape_Morrone_Adapt_Andes_1000m.gpkg
├── Post analysis processing
│ └── 07_post_analysis_processing_final.R
└── Historical biogeography
├── 06_BioGeoBEARS_analysis_DEC+J_null_new_BSM_TS_array.sh
├── 06_BioGeoBEARS_analysis_DEC+J_null_TS_new_BSM_array.R
├── BioGeoBEARS_posterior_trees_count_events_v37_SLURM.R
└── LTT.sh
Data generation
This folder contains scripts and files for preparing trees, retrieving and cleaning occurrence records, and mapping occurrences to trees.
To map occurrence records to our desired biogeographical regions, we redefined regions from Morrone et al. (2022) using QGIS 3.40.6 'Bratislava'. Long Term Release (LTR) https://www.qgis.org/.
Scripts are run through R Studio and can be executed on a standard PC.
To run the following scripts, you should generate a folder R and set your working directory to ~R/. Within the folder R, generate a folder Output and a folder Input.
1. Prepare tree
The script 01_prepare_tree.R uses the R package ftolr to retrieve the plastid phylogenetic tree published by Nitta et al. (2022), which, as of April 2025, represents the most comprehensive phylogenetic tree of ferns (Polypodiopsida). The script verifies that the tree is ultrametric and prunes non-fern outgroup taxa to focus exclusively on ferns.
2. Prepare data
The script 02_prepare_data.R retrieves occurrence data from the Global Biodiversity Information Facility (GBIF) using this occurrence download and cleans the occurrence records using an automated data cleaning process using the R package CoordinateCleaner (Zizka et al., 2019) and matches the taxonomy using the R packages pteridocat (FTOL Working Group 2022) and taxastand (Nitta, 2022).
The following files are produced by the script, but will not be uploaded due to the data provider's licensing.
GBIF_clean.xlsx: This file contains all the occurrence records that remain after data cleaning.GBIF_flagged.xlsx: This file contains all the occurrence records that were flagged during data cleaning.fern_data_complete.xlsx: This file contains the cleaned and taxonomy-resolved data.fern_data_compressed.xlsx: This file contains a simple version of the cleaned and taxonomy-resolved data, only including the required columns: tree_name, decimalLatitude, and decimalLongitude.
3. Map occurrence records to bioregions
The script 03_map_occ.records.R matches the GBIF occurrence records to the nine biogeographical regions in the shapefile Shape_Morrone_Adapt_Andes_1000m.gpkg, which is an adapted version of the shapefile published by Morrone et al. (2022). The occurrence records for each species are only assigned to a biogeographical region if at least 5% of the occurrence records fall within the region. Finally, we mapped the geographic distributions of the species assigned based on occurrence records to species found in the phylogenetic tree.
4. Cut the tree into monophyletic clades
The script 04_cut_tree.R cuts the phylogenetic tree into clades of 9-100 species based on an 85% 'monophyly' threshold, where at least 85% of the monophyletic clade needs to include species assigned to the American tropics (falling within the shapefile Shape_Morrone_Adapt_Andes_1000m.gpkg). To be able to run this script, you need to download the script CladeByTrait.R and the function getDescend.R. These files are a part of the R package SpeciesGeoCoder (Töpel, 2016), which has been defunct as of September 2023.
5. Prepare BioGeoBEARS file
The script 05_prepare_BioGeoBEARS.R prepares the geography files for BioGeoBEARS by matching the geographic distribution file with the phylogenetic tree clades. We also make some manual edits and drop tips of species that are hybrids or species at the 'base' of the tree that are not tropical American. The final tree- and biogeography files can be found in the folder 5. Prepare BioGeoBEARS file. See section Data cleaning and preparation - Trees and occurrence records.
Historical biogeography
Scripts in this folder should be run on an HPC.
- Note our scripts are set up for an HPC using Slurm Workload Manager.
The required input files for the two BioGeoBEARS scripts are found in the folder Data cleaning and preparation - Trees and occurrence records, sub-folder 5. Prepare BioGeoBEARS file.
6. Run BioGeoBEARS
- Script
06_BioGeoBEARS_analysis_DEC+J_null_TS_BSM_array.Rincludes adapted code for running the BioGeoBEARS, Biogeographic Stochastic Mapping (BSM) analysis published by Matzke (2016) using the R package BiogeoBEARS by Matzke (2018). Script06_BioGeoBEARS_analysis_DEC+J_null_new_BSM_TS_array.shincludes the Slurm job settings. - Script
BioGeoBEARS_posterior_trees_count_events_v37_SLURM.Rruns the Lineage Through Time (LTT) analysis using the R package BioGeoBEARS developed and published by Magalhaes et al. (2021). ScriptLTT.shincludes the Slurm job settings.
Post analysis processing
- Script
07_post_analysis_processing_final.Rincludes code for generating the manuscript analyses and figures and can be run on a standard PC.
To run the following scripts, you should set your working directory to ~R/Results/TS/null_new/ or ~R/Results/TS/LTT/ depending on your path. The script will indicate whether the path should be set to folder LTT or null_new.
7. Generate manuscript figures.
The script 07_post_analysis_processing.R produces all figures and supplementary figures.
Data cleaning and preparation - Trees and occurrence records
Word and symbol meaning:
*Numbers correspond to clades 1-56.
Bioregions: (A) Venezuela, (B) Mesoamerica, (C) Guianas, (D) Amazonia, (E) Antilles, (F) Chaco, (G) Chocó, (H) Paraná, (I) Andes, (J) non-American tropics.
Description of the data and file structure
The folder structure for Data cleaning and preparation - Trees and occurrence records.zip is as follows:
Data cleaning and preparation - Trees and occurrence records.zip (* 1-56)
├── 1. Prepare tree
│ ├── _tips.txt
│ ├── Nitta_raw_datedTree.pdf
│ ├── polypodiopsida_tips.txt
│ ├── polypodiopsida_tree.newick
│ └── polypodiopsida_tree.nexus
├── 2. Prepare data
│ ├── fern_data_complete.xlsx (not included)
│ ├── fern_data_compressed.xlsx (not included)
│ ├── GBIF_clean.xlsx (not included)
│ └── GBIF_flagged.xlsx (not included)
├── 3. Map occurrence records to bioregions
│ ├── ferns_classified_neotropics.txt
│ ├── ferns_classified.txt
│ ├── prepBioGeoBEARS.txt
│ └── species_binary.txt
├── 4. Cut the tree into monophyletic clades
│ ├── CladesByTrait_log.txt
│ ├── ferns_area_classification_used.txt
│ ├── ferns_phylogeny+traitdata.pdf
│ └── fernsclade_*.tre
└── 5. Prepare BioGeoBEARS file
├── BioGeoBEARS geography files
│ └── filtered_prepBioGeoBEARS_*.txt
└── BioGeoBEARS tree files
└── filtered_tree_*.tre
Contains sub-folders for each script in Data Generation under the headings 1.Prepare tree etc. corresponding to each of the scripts 1-5.
The intermediate data output for 5. Prepare BioGeoBEARS file has not been uploaded, as we also have some manual steps. Instead, see sub-folder BioGeoBEARS geography files and BioGeoBEARS tree files in 5. Prepare BioGeoBEARS file where the final trees and geography files required by BioGeoBEARS are available.
- The 1.Prepare tree folder contains the output from running script
01_prepare_tree.R._tips.txt: This file contains the tip names of the raw phylogenetic tree by Nitta et al. (2022).polypodiopsida_tips.txt: This file contains the tip names of the filtered phylogenetic tree of Nitta et al. (2022), only including Polypodiopsida.Nitta_raw_datedTree.pdf: PDF of the raw dated Nitta et al. (2022) phylogenetic tree.- Blue labels are node numbers and yellow labels are tip numbers.
polypodiopsida_tree.newick: This file contains the filtered phylogenetic tree of Nitta et al. (2022), only including Polypodiopsida in newick format.polypodiopsida_tree.nexus: This file contains the filtered phylogenetic tree of Nitta et al. (2022), only including Polypodiopsida in nexus format.
- The 2.Prepare data folder is empty, due to licensing of the data providers. Run the
02_prepare_data.Rto produce the files. - The 3.Map occurrence records to bioregions folder contains the output from running script
03_map_occ.records.R..txt: These files contain species names and the species presence or absence in bioregions as defined by 1 and 0, respectively.ferns_classified.txt: This file includes all the species classified according to the defined bioregions.ferns_classified_neotropics.txt: This file includes only the species present in the defined bioregions and classified accordingly.prepBioGeoBEARS.txt: Species presence or absence in any of the defined bioregions.species_binary.txt: This file is used to cut the filtered phylogenetic tree into smaller primarily tropical American clades based on presence or absence inShape_Morrone_Adapt_Andes_1000m.gpkg.
- The 4.Cut tree to monophyletic clades folder contains the output from running script
04_cut_tree.R.CladesByTrait_log.txt: Log file from running the script.ferns_area_classification_used.txt: This file contains the tree tips and information on presence (1) or absence (0) in the shapefileShape_Morrone_Adapt_Andes_1000m.gpkg.ferns_phylogeny+traitdata.pdf: PDF showing total overview of species presence and absence, including produced clades according to an 85% monophyly threshold.fernsclade_*.tre: These files contain monophyletic clades where at least 85% of the species are present in at least one of the bioregions.- Input for running script
06_BioGeoBEARS_analysis_DEC+J_null_TS_BSM_array.RandBioGeoBEARS_posterior_trees_count_events_v37_SLURM.R
- Input for running script
- The 5.Prepare BioGeoBEARS file folder contains the output from running script
05_prepare_BioGeoBEARS.R.filtered_prepBioGeoBEARS_*.txtin the subfolder BioGeoBEARS geography files. These files contain the species bioregion classification with presence (1) or absence (0) for each clade.fernsclade_*.trein the subfolder BioGeoBEARS tree files. These files contain the phylogenetic trees for each clade.
Biogeographic analyses - BioGeoBEARS
Word and symbol meaning:
*Numbers correspond to clades 1-56.
#Numbers correspond to timeslices 1-135.
Bioregions: (A) Venezuela, (B) Mesoamerica, (C) Guianas, (D) Amazonia, (E) Antilles, (F) Chaco, (G) Chocó, (H) Paraná, (I) Andes, (J) non-American tropics.
Description of the data and file structure
The folder structure for Biogeographic_analyses_-_BioGeoBEARS.zip is as follows:
Output_of_analyses.zip (# from 1-135, * 1-56)
Biogeographic analyses - BioGeoBEARS.zip
├── LTT
│ └── Count
│ └── CountStatesByTime_all_reps_*.csv
└── null_new
├── a_counts_fromto_means_*.txt
├── a_counts_fromto_sds_*.txt
├── all_dispersals_counts_fromto_means_*.txt
├── all_dispersals_counts_fromto_sds_*.txt
├── ana_dispersals_counts_fromto_means_*.txt
├── ana_dispersals_counts_fromto_sds_*.txt
├── BSM_inputs_file_*.Rdata
├── d_counts_fromto_means_*.txt
├── d_counts_fromto_sds_*.txt
├── DEC+J_M0_*__time#_a_counts_fromto_means.txt
├── DEC+J_M0_*__time#_a_counts_fromto_sds.txt
├── DEC+J_M0_*__time#_all_dispersals_counts_fromto_means.txt
├── DEC+J_M0_*__time#_all_dispersals_counts_fromto_sds.txt
├── DEC+J_M0_*__time#_ana_dispersals_counts_fromto_means.txt
├── DEC+J_M0_*__time#_ana_dispersals_counts_fromto_sds.txt
├── DEC+J_M0_*__time#_d_counts_fromto_means.txt
├── DEC+J_M0_*__time#_d_counts_fromto_sds.txt
├── DEC+J_M0_*__time#_e_counts_rectangle.txt
├── DEC+J_M0_*__time#_founder_counts_fromto_means.txt
├── DEC+J_M0_*__time#_founder_counts_fromto_sds.txt
├── DEC+J_M0_*__time#_summary_counts_BSMs.txt
├── DEC+J_M0_*__time#_unique_sub_counts.txt
├── DEC+J_M0_*__time#_unique_vic_counts.txt
├── Fern_DEC+J_M0_v1_*.txt
├── Ferns_DEC+J_M0_v1_*.Rdata
├── filtered_prepBioGeoBEARS_*.txt
├── filtered_tree_*.tre
├── founder_counts_fromto_means_*.txt
├── founder_counts_fromto_sds_*.txt
├── RES_ana_events_tables_*.Rdata
├── RES_ana_events_tables_PARTIAL.Rdata
├── RES_ana_events_tables.Rdata
├── RES_clado_events_tables_*.Rdata
├── summary_counts_BSMs_*.txt
├── unique_sub_counts_*.txt
├── unique_sym_counts_*.txt
└── unique_vic_counts_*.txt
Biogeographic analyses - BioGeoBEARS
- Folder null_new contains the output from script
06_BioGeoBEARS_analysis_DEC+J_null_TS_BSM_array.R..txt: These files are standard BioGeoBEARS BSM output files outputted for each clade (1-56).- 'DEC+J_M0_*__time#' in the title: These files are BioGeoBEARS output files that, in addition to per clade, give the number of events per timeslice in 1 My increments.
- 'fromto' in the title: Rows indicate source bioregions and columns indicate sink bioregions. Numbers indicate the modelled number of incidents of the given biogeographical events based on the Dispersal, Extinction, Cladogenesis, including jump dispersal (DEC+J) model.
- 'means' in the title indicates it is the average number of events taken across 100 stochastic maps.
- 'sds' in the title are the standard deviations.
- Meaning of abbreviations used in the file names.
- a: Range-switching dispersal (all observed 'a' dispersals).
- d: Range-expansion dispersal (all observed 'd' dispersals).
- ana_dispersals: Anagenetic dispersal (mean of all observed anagenetic 'a' or 'd' dispersals):
- founder: Cladogenetic dispersal (mean of all observed jump 'j' dispersals):
- all_dispersals: ALL dispersal (mean of all observed anagenetic 'a', 'd' dispersals, including cladogenetic founder/jump dispersal):
- 'unique_@_counts' in the title indicates cladogenetic events where rows are the stochastic maps from 1-100 and columns are the bioregion or combination of bioregions involved in the event.
- Abbreviations (@):
- sub: Subset sympatry. Arrow indicates that a bioregion goes from one state to another. e.g., ABC --> A, ABC
- sym: Sympatry.
- vic: Vicariance. Arrow indicates that a bioregion goes from one state to another. e.g., ABC --> A, BC
- Abbreviations (@):
- 'summary_counts_BSMs' in the title indicates files that give the summary of event counts from the 100 stochastic maps.
_*.Rdata: These files are R data files containing data from different steps of running the BioGeoBEARS script. This includes both input and result files for each clade.filtered_tree_*.tre: These files are the input tree files for each of the clades used to run the script..filtered_prepBioGeoBEARS_*.txt: These files contain the species bioregion classification with presence (1) or absence (0) corresponding to the trees infiltered_tree_*.trein phylip format.
- Folder** LTT** contains the lineage through time (LTT) output from script
BioGeoBEARS_posterior_trees_count_events_v37_SLURM.Rthat has been used in our analyses._*.csvfiles in folder Count: These files contain the number of lineages present in a given combination of bioregions at different time points based on 100 repetitions of biogeographic stochastic mapping with the following columns in the following order:- TimePoint: Timeslices in 1 My increments, e.g., 0 = 0-1 My.
- BSM: Biogeographic stochastic mapping repetition. 1-100.
- A, B, C etc.: Bioregions.
- SUM_A etc.: Total number of lineages present in the given bioregion, in any combination of bioregions.
Figures
Description of the data and file structure
Figures.zip
(# from 1-135, * 1-56)
├── DEC+J_M0_*__time#_histograms_of_event_counts.pdf
├── DEC+J_M0_histograms_of_event_counts_*.pdf
├── DEC+J_M0_ML_vs_BSM.pdf
├── DEC+J_M0_single_stochastic_map_m0_*.pdf
└── DEC+J_M0_100BSMs_v1_*.pdf
This folder contains output PDF files from running the script 06_BioGeoBEARS_analysis_DEC+J_null_TS_BSM_array.R, which runs a BioGeoBEARS analysis using the DEC + J biogeographical model.
DEC+J_M0_*__time#_histograms_of_event_counts.pdf: These files contain a histogram showing biogeographic event counts in each of 100 BSMs for each clade and each timeslice. The events are founder events, vicariance, subset sympatry, narrow sympatry ,and anagenetic dispersal.DEC+J_M0_histograms_of_event_counts_*.pdf: These files contain a histogram showing biogeographic event counts in each of 100 BSMs for each clade across all timeslices. The events are founder events, vicariance, subset sympatry, narrow sympatry, and anagenetic dispersal.DEC+J_M0_ML_vs_BSM.pdf: These files contain a comparison between the Maximum Likelihood (ML) ancestral range estimates and those from Biogeographic Stochastic Mapping (BSM).DEC+J_M0_single_stochastic_map_m0_*.pdf: These files contain a visualisation of a single stochastic map showing the inferred biogeographic events along branches of the phylogenetic tree for each clade.DEC+J_M0_100BSMs_v1_*.pdf: These files contain results from 100 Biogeographic Stochastic Maps (BSMs) for each clade.
References
To conduct the analyses, we have depended on the following packages and scripts:
BioGeoBEARS:
Matzke, Nicholas J. (2018). BioGeoBEARS: BioGeography with Bayesian (and likelihood) Evolutionary Analysis with R Scripts. version 1.1.1, published on GitHub on November 6, 2018. DOI: http://dx.doi.org/10.5281/zenodo.1478250
BioGeoBEARS - BSM:
Matzke, N. J. 2016. Stochastic mapping under biogeographical models. - PhyloWiki BioGeoBEARS website.
BioGeoBEARS_posterior_trees_count_events_v37_SLURM.R:
Magalhaes, I. L. F., Santos, A. J. and Ramírez, M. J. 2021. Incorporating Topological and Age Uncertainty into Event-Based Biogeography of Sand Spiders Supports Paleo-Islands in Galapagos and Ancient Connections among Neotropical Dry Forests. - Diversity 13: 418.
CoordinateCleaner:
Zizka, A., Silvestro, D., Andermann, T., Azevedo, J., Duarte Ritter, C., Edler, D., Farooq, H., Herdean, A., Ariza, M., Scharn, R., Svantesson, S., Wengström, N., Zizka, V. and Antonelli, A. 2019. CoordinateCleaner: Standardized cleaning of occurrence records from biological collection databases. - Methods Ecol. Evol. 10: 744–751.
ftolr:
FTOL working group 2024. ftolr: Data for the Fern Tree of Life (FTOL). Zenodo. doi: 10.5281/zenodo.14010953.
GBIF:
GBIF.org (21 May 2024). GBIF Occurrence Download. https://doi.org/10.15468/dl.k6w5vk.
Morrone shapefile:
Morrone, J. J., Escalante, T., Rodríguez-Tapia, G., Carmona, A., Arana, M. and Mercado-Gómez, J. D. 2022. Biogeographic regionalization of the Neotropical region: New map and shapefile. - An. Acad. Bras. Ciênc. 94: e20211167.
Phylogenetic tree:
Nitta JH, Schuettpelz E, Ramírez-Barahona S and Iwasaki W (2022). An open and continuously updated fern tree of life. Front. Plant Sci. 13:909768. doi: 10.3389/fpls.2022.909768
pteridocat:
FTOL Working Group 2022. pteridocat: A taxonomic database of pteridophytes.https://doi.org/10.5281/zenodo.7583310.
R:
R Core Team 2024. R: A Language and Environment for Statistical Computing. - R Foundation for Statistical Computing. https://www.R-project.org/.
SpeciesGeoCoder:
Töpel, M., Zizka, A., Calió, M. F., Scharn, R., Silvestro, D. and Antonelli, A. 2017. SpeciesGeoCoder: Fast Categorization of Species Occurrences for Analyses of Biodiversity, Biogeography, Ecology, and Evolution. - Syst. Biol. 66: 145–151.
taxastand:
Nitta, J. H. 2022. “joelnitta/taxastand: v1.0.0”. Zenodo. doi: 10.5281/zenodo.6692810.
World Ferns:
Hassler, M. 2025. *World Ferns. Synonymic Checklist and Distribution of Ferns and Lycophytes of the World.*Version 25.04.
