# README ### WCVP, GenBank, and BIEN download data for the phylogenetic knowledge manuscript "The Darwinian shortfall in plants: phylogenetic knowledge is driven by range size" Authors: Alexander V Rudbeck, Miao Sun, Melanie Tietje, Rachael V. Gallagher, Rafaël Govaerts, Stephen A. Smith, Jens-Christian Svenning, Wolf L. Eiserhardt ### DATA ### checklist_names.txt > WCVP taxonomic reference data > download from July 2021. Non-public data, access via link provided by Rafaël Govaerts > description of relevant variables: - plant_name id: unique ID for all entries - taxon_rank: describes whether the entry represents a species, genus, variety ect. - taxon_status: describes if the entry represents an accepted species or a synonym - taxon_name: name of the entered taxa - accepted_plant_name_id: plant name ID of the accepted taxa that the entry represents. If the entry is of an accepted taxa, it is identical to plant_name_id. > The dataset includes empty cells, where no values are applicable. There are no empty cells in the columns relevant to this analysis. checklist_distribution.txt > WCVP distribution data > download from July 2021. Non-public data, access via link provided by Rafaël Govaerts > description of relevant variables: - plant_name id: unique ID for all entries - area_code_l3: the three-letter code that is unique to the botanical countries used in the analysis fern_list.txt > List of fern families to exclude from WCVP data bryophyta_list.txt > List of moss families to exclude from WCVP data level3.shp level3.shp.xml level3.sbn level3.dbf level3.sbx level3.shx > Shapefile for botanical countries > downloaded via https://github.com/tdwg/wgsrpd gb_280321(4).txt > subset of the plant division of the NCBI GenBank database download from March 2021, which only includes entries whose description include one of 128 most relevant markers for plant phylogenetics. > SQLite search query: SELECT description, name FROM sequence LEFT JOIN taxonomy ON taxonomy.ncbi_id = sequence.ncbi_id WHERE name_class = "scientific name" AND description LIKE '%accD%' OR description LIKE '%atp1%' OR description LIKE '%atp9%' OR description LIKE '%atp synthase subunit 9%' OR description LIKE '%atpA%' OR description LIKE '%atp1%' OR description LIKE '%atpase alpha subunit%' OR description LIKE '%atpB%' OR description LIKE '%atpE%' OR description LIKE '%atpF%' OR description LIKE '%atpH%' OR description LIKE '%atpI%' OR description LIKE '%ccsA%' OR description LIKE '%cemA%' OR description LIKE '%clpP%' OR description LIKE '%cob%' OR description LIKE '%cox1%' OR description LIKE '%cox2%' OR description LIKE '%coxii%' OR description LIKE '%cytochrome oxidase subunit ii%' OR description LIKE '%cytochrome oxidase subunit 2%' OR description LIKE '%ets%' OR description LIKE '%external transcribed spacer%' OR description LIKE '%infA%' OR description LIKE '%its2%' OR description LIKE '%internal transcribed spacer%' OR description LIKE '%5.8S%' OR description LIKE '%its%' OR description LIKE '%23S%' OR description LIKE '%5S%' OR description LIKE '%4.5S%' OR description LIKE '%matK%' OR description LIKE '%matR%' OR description LIKE '%nad2%' OR description LIKE '%nad5%' OR description LIKE '%ndhA%' OR description LIKE '%ndhB%' OR description LIKE '%ndhC%' OR description LIKE '%ndhD%' OR description LIKE '%ndhE%' OR description LIKE '%ndhF%' OR description LIKE '%ndhG%' OR description LIKE '%ndhH%' OR description LIKE '%ndhI%' OR description LIKE '%ndhJ%' OR description LIKE '%ndhK%' OR description LIKE '%petA%' OR description LIKE '%petB%' OR description LIKE '%petD%' OR description LIKE '%petG%' OR description LIKE '%petL%' OR description LIKE '%petN%' OR description LIKE '%phyA%' OR description LIKE '%phyB%' OR description LIKE '%phytochrome A%' OR description LIKE '%phytochrome B%' OR description LIKE '%psaA%' OR description LIKE '%psaB%' OR description LIKE '%psaC%' OR description LIKE '%psaI%' OR description LIKE '%psaJ%' OR description LIKE '%psbA%' OR description LIKE '%trnH%' OR description LIKE '%psbB%' OR description LIKE '%psbC%' OR description LIKE '%psbD%' OR description LIKE '%psbE%' OR description LIKE '%psbF%' OR description LIKE '%psbH%' OR description LIKE '%psbI%' OR description LIKE '%psbJ%' OR description LIKE '%psbK%' OR description LIKE '%psbL%' OR description LIKE '%psbM%' OR description LIKE '%psbN%' OR description LIKE '%psbT%' OR description LIKE '%psbZ%' OR description LIKE '%rbcL%' OR description LIKE '%rpl14%' OR description LIKE '%rpl16%' OR description LIKE '%rpl2%' OR description LIKE '%rpl20%' OR description LIKE '%rpl22%' OR description LIKE '%rpl23%' OR description LIKE '%rpl32%' OR description LIKE '%rpl33%' OR description LIKE '%rpl36%' OR description LIKE '%rpoA%' OR description LIKE '%rpoB%' OR description LIKE '%rpoC1%' OR description LIKE '%rpoC2%' OR description LIKE '%rps11%' OR description LIKE '%rps12%' OR description LIKE '%rps14%' OR description LIKE '%rps15%' OR description LIKE '%rps16%' OR description LIKE '%rps18%' OR description LIKE '%rps19%' OR description LIKE '%rps2%' OR description LIKE '%rps3%' OR description LIKE '%rps4%' OR description LIKE '%rps7%' OR description LIKE '%rps8%' OR description LIKE '%rrn16%' OR description LIKE '%rrn23%' OR description LIKE '%rrn4.5%' OR description LIKE '%rrn5%' OR description LIKE '%trnC%' OR description LIKE '%petN%' OR description LIKE '%ycf6%' OR description LIKE '%trnF%' OR description LIKE '%trnG%' OR description LIKE '%trnK%' OR description LIKE '%trnQ%' OR description LIKE '%tRNA-Ser%' OR description LIKE '%trnS%' OR description LIKE '%trnS%' OR description LIKE '%trnT%' OR description LIKE '%trnF%' OR description LIKE '%trnV%' OR description LIKE '%trnL%' OR description LIKE '%trnY%' OR description LIKE '%ycf1%' OR description LIKE '%ycf2%' OR description LIKE '%ycf3%' genbank_entries_2022.csv > table of genbank plant entries with relevant phylogenetic data as of March 2021 > table produced by "02_genbank_download.R" script > as the script runtime exceeds 3 hours, the result is directly available here to reduce reproducability time genbank_entries_w_duplicates_2022.csv > table of genbank plant entries with relevant phylogenetic data, when all markers are preserved, as of March 2021 > table produced by "02_genbank_download.R" script > as the script runtime exceeds 3 hours, the result is directly available here to reduce reproducability time BIEN_WCVP_merged_Sep21_no_centroids.rds > BIEN download from 23 April 2020 > including WCVP accepted taxon ID (using our taxonomy matcher as described here: https:/doi.org/10.5281/zenodo.5656763) > download settings: library(BIEN) all_bien_occurrences <- BIEN:::.BIEN_sql("SELECT scrubbed_family, scrubbed_species_binomial, scrubbed_taxon_name_no_author, scrubbed_author, taxonobservation_id, latitude, longitude, is_centroid FROM view_full_occurrence_individual WHERE is_geovalid = 1 AND observation_type IN ('plot', 'plot occurrence','specimen', 'literature','checklist', 'checklist occurrence') AND higher_plant_group NOT IN ('Algae','Bacteria','Fungi') AND is_cultivated_observation = 0;") print(dim(all_bien_occurrences)[1]) > excluding cases with incomplete taxonobservation_id, lat, lng. Cleaned for centroids BIEN_in_WCVP_regions_Sept21.RData > BIEN occurrence results: >> res; dataframe with species richness counts per botanical country) >> spec.list; accepted taxon IDs for BIEN occurrences for each botanical country L3_and_area.csv > Level 3 botanical countries and their area sizes socioeco_var.csv > socioeconomic variables for botanical countries > description of variables: - LEVEL_3_CO: The three letter area code that is unique to each of the botanical countries used in this analysis - AREA_SQKM: Area size of the botanical country in km^2 - GDP_SUM: The total Gross Domestic Product of the botanical countries - GDP_CAPITA: The Gross Domestic product per Capita - ROAD_DENSITY: The mean road density as meters of road på km^2 area - POP_COUNT: total population in the area - POP_DENSITY: Population per km^2 - SECURITY: Measured with the Global Peace Index (GPI), which describes the level of peace in a nation. Low values indicate high level of peace - RESEARCH_EXP: % of Gross Domestic Product invested into Research and Development - EDUCATION_EXP: % of Gross Domestic Product invested into Education # The scripts to process this data are available at https://github.com/pebgroup/Distribution_of_phylogenetic_data #