Analysis of biodiversity data suggests that mammal species are hidden in predictable places
Data files
Mar 23, 2022 version files 105.16 MB
-
genetic_data.zip
5.41 MB
-
geographic_data.zip
99.73 MB
-
README.txt
2.67 KB
-
supplemental_var.xlsx
18.08 KB
Abstract
Research in the biological sciences is hampered by the Linnean shortfall, which describes the number of hidden species that are suspected of existing without formal species description. Using machine learning and species delimitation methods, we built a predictive model that incorporates some 5.0 × 105 data points for 117 species traits, 3.3 × 106 occurrence records, and 9.1 × 105 gene sequences from 4,310 recognized species of mammals. Delimitation results suggest that there are hundreds of undescribed species in class Mammalia. Predictive modeling indicates that most of these hid- den species will be found in small-bodied taxa with large ranges characterized by high variability in temperature and precipitation. As demonstrated by a quantitative analysis of the literature, such taxa have long been the focus of taxonomic research. This analysis supports taxonomic hypotheses regarding where undescribed diversity is likely to be found and highlights the need for investment in taxonomic research to overcome the Linnean shortfall.
Genetic data: We downloaded all available mammalian DNA sequences for the mitochondrial genes cytochrome-c oxidase I (COI) and cytochrome-b (cytb) from the NIH genetic sequence database, GenBank. For each gene, we grouped sequences by species and then manually checked all species records for errors (e.g., subspecies, duplicates, extinct species, etc.). To ensure standardization across groups, we updated all sequence taxonomy to reflect that of the Mammal Diversity Database (MDD) published by the American Society of Mammologists. Following taxonomic standardization, we grouped sequence records for each gene by family and generated multiple sequence alignments for COI and cytb independently using MUSCLE v3.5. We then visually inspected each family-level alignment for gaps and removed problematic sequences causing severe gaps or misalignment that could not be resolved through reverse complement or manual correction.
Geographic data: We first downloaded all geographic coordinates for class Mammalia from the Global Biodiversity Information Facility (GBIF) and used these to extract data from several GIS layers, including elevation, the 19 BIOCLIM layers at 1-km resolution pertaining to temperature and precipitation available from the World- Clim database, population density, gross domestic product, light pollution, protected areas, anthropogenic biomes, and GlobCover by the European Space Agency. We then curated these occurrence records using the R package, Coordinate Cleaner.
Genetic data: Sequence alignments are organized taxonomically in folders corresponding to orders and families. Within each family folder there are alignment files titled "COI_unique.fas" and "cytb_unique.fas" that contain family-level alignments for COI and cytb, respectively. Certain groups may be missing one or both files, indicating that there was not enough sequence data to generate an alignment for that particular group.
Geographic data: Occurrence records are organized taxonomically in folders corresponding to orders and families. Within each family folder there are species files containing the corresponding species geographic data. Files titled "genus_species.csv" contain a list of all coordinates gathered for said species. Files titled "genus_species_clean.txt" contain the results of Coordinate Cleaner curation, with each row after the species name representing a specific flag, and values of "FALSE" meaning the occurrence was flagged and discarded and values of "TRUE" meaning the occurrence was kept. Files titled "genus_species_var.txt" contains values extracted from GIS layers using the species curated occurrence records. Please see "README.txt" and "supplemental_var.xlsx" files for more detailed information on GIS layers used and descriptions of values extracted. Each family folder also contains a "family_var.txt" file, which contains values from all of the "genus_species_var.txt" files located within the family folder. Similarly, each order folder also contains an "order_var.txt" file, which contains values from all of the "family_var.txt" files located within the order folder. Certain groups may be missing one or more files, indicating that there was not enough occurrence data to generate occurrence files for that particular group.