README file for Dryad repository, created on June 13, 2018 Explaining the ocean’s dominant species richness gradient and global patterns of fish diversity Authors: Elizabeth Christina Miller, Kenji T. Hayashi, Dongyuan Song, and John J. Wiens Corresponding author: EC Miller (ecmiller 'at' email.arizona.edu or lizmiller2633@gmail.com) Please contact with questions or requests for additional material. R scripts are ordered following the Material and Methods section. Input files are included. a) Phylogeny and biogeographic coding Script 1 downloads and cleans OBIS and GBIF data, and plots occurrences on QGIS shape files to code presence/absence. Also included are the QGIS shape files, and list of 17,458 species names from FishBase in August 2016 ("perc_complete_list_aug11_2016.csv"). Note that much of the work described in this section was done manually as described in Appendix S1, including compiling the final datasets. Final datasets (georeferenced, FishBase, and IUCN) are found in the ESM (Dataset S1). They are also included here as 3 separate .csv files b) Ancestral range estimation using BioGeoBEARS Script 2 fits DEC model and simulates 100 independent histories ("stochastic maps") using BioGeoBEARS, then generates output files for each stochastic map containing colonization events and their ages. All associated files for running BioGeoBEARS are included in folder "biogeobears_input_files": 1. "perc_tree_edited.tree" is the modified phylogeny (see Appendix S1) 2. "fish_areas_phylip.txt" is the range codes for species in the phylogeny, based on georef. data 3. "fish_time_periods.txt" is the time bins associated with dispersal matrices 4. "fish_dispersal_matrix.txt" is the series of dispersal matrices associated with time bins There is also a sub-directory containing modified files to impose fossil-based constraints: 1. "fish_areasallowed_matrix.txt" is needed to only allow colonization to the CIP at 34 myr and after 2. "fish_time_periods_fossils.txt" is needed to denote the change at 34 myr 3. "fish_dispersal_matrix_fossils.txt" is needed to denote the change at 34 myr 4. "bgb_fossil_constraints.txt" contains text needed to impose fossil constraints on nodes NOTE that areas are coded as single letters to allow BioGeoBEARS compatibility, instead of their abbreviations in Fig. 1. These single-letter codes are also used in downstream scripts. These codes are: A= WA; B= EA; C= WIP; D= CIP; E= CP; F= EP; G= NC; H= SC; I = FW The output files generated by Script 2 and used by Miller et al. are included (these may vary with independent runs of the script, because the stochastic maps are independent simulations). They are in directory c) colonization analyses c) Colonization analyses Input files are stochastic maps (sub-directory "stochastic_maps"). For reconstructions with and without fossil constraints, these are: 1. "stochastic_maps_all_states": one file for each of 100 stochastic maps. The numbers in column 1 are each tip (1-4571) and node (4572-9141). The letters in column 2 are the reconstructed states (at nodes) 2. "stochastic_maps_independent_events": one file for each of 100 stochastic maps. Only independent colonization events to a new region are listed. For each region (e.g. region A), the column A is the number corresponding to the node or tip, and the column A_ages is the age of the colonization event. Script 3 performs 100 regressions (one for each stochastic map) of regional richness and three metrics of colonization history (Fig. 1a-b). Script 4 performs regression analyses within the period 34-5.3 Ma (Fig. 1c–d). Script 5 performs regression analyses within the period 5.3-0 Ma (Fig. 1e–f). Script 6 is an example for the 5 Ma time bin visualization (Fig. 2). For one area, this script gets the proportion of descendants and number of colonizations for each 5 Ma time bin. Scripts for the other eight areas are identical unless noted, and excluded for brevity. Note that to create Figure 2, we took the average values of 100 stochastic maps from each 5-myr time bin. This can be done using output files created by script 6. d) Net diversification rates Script 7 calculates net diversification rates and the weighted mean net rate for each region (using families), and performs regression analyses (Fig. 3a). Input file "net_div_regional_rich.csv" is included, and contains the richness, crown and stem ages of each family. e) HiSSE Script 8 creates input data files appropriate for binary comparisons, and performs HiSSE model comparison (Fig. 3b). We also include code for performing model-averaging and separating tips by hidden states. Input file "fish_areas_final_sep1.csv" is included for convenience. (this is the same as that in subdirectory a) phylogeny and biogeographic coding)