Data for: Five decades of biogeography: A view from the Journal of Biogeography
Data files
Dec 26, 2022 version files 166.51 MB
-
README.md
-
ref_metadata_with_topics.rds
-
wos_results_biog.rds
Abstract
Since the first issue of Journal of Biogeography (JBI) was published in 1974, the discipline and its eponymous journal have grown in scope and consequence. On this 50th anniversary of the journal, we reflect on changes in biogeography and publishing, describe trends of the past five decades, and present lists of the 50 most-cited articles from JBI’s back catalogue. We describe current initiatives intended to chart a course for continued success in the coming 50 years, during what may well be a period of global biogeographic crises.
Methods
Literature search
We carried out a search of peer-reviewed literature associated with biogeography research and indexed in Web of Science’s Core Collection using the functions query_wos and pull_wos from the ‘wosr’ package (Baker, 2018) for R statistical software (R Core Team, 2020). For this, we used a search string composed that included variations of the word biogeography as well as articles published in journals associated with biogeography research, as follows:
- "TS=biogeograph* OR SO=(Journal of Biogeography OR Diversity and Distributions OR Global Ecology and Biogeography OR Ecography OR Global Ecology and Biogeography Letters)
The search resulted in a list of 66,812 references which were considered for further analysis of the main topics addressed in this sample of the literature using topic modelling.
Topic modelling
We used latent Dirichlet allocation (LDA) on the combined text of titles and abstracts to assess the dominant themes being discussed in the sampled literature (Blei et al. 2015). LDA is a statistical model that estimates the number of topics in a set of documents based on the premise that words used to discuss a particular topic across documents will tend to occur together more frequently when compared to the rest of the words.
Prior to the topic modelling implementation, we stemmed the words featuring in the titles and abstracts, removed stopwords, punctuation and numbers, converted all words to lowercase, and removed keywords featuring in less than 2% or more than 98% of the references. The remaining list of keywords was converted into a document per term matrix used to run the LDA analysis. The total number of topics that are discovered in the data needs to be set manually in LDA, so we varied the number of topics between 2 and 25 to identify the ideal number of topics. We used the function FindTopicsNumber from package ‘ldatuning’ (Nikita, 2020) to identify the ideal number of topics based on four metrics, proposed by Arun et al. (2010), Deveaud et al. (2014), Juan et al. (2009), and Griffiths & Steyvers (2004). We identified a good balance between the metrics when the model was trained on 22 topics based on an exploration of the resulting topics; the optimal number of topics represents the fewest topics that maximise the information covered as close to the original text as possible. It should be noted that while such metrics can be a useful guiding principle, the best judgement is often achieved by human interpretations of resultant topics (Chang et al., 2009).
Once the final model was set, we assigned each reference to the topic with the highest probability of being associated with it. This classification was then used to analyse the set of publications associated with 15 topics that were identified as strongly-related to biogeography research and cover the majority of the sampled literature (> 80% of references) published since 1974. Specifically, we calculated the percentage of references associated with these topics during each decade (1974–1979, 1980–1989, 1990–1999, 2000–2009, 2010–2019, 2020–present) in the set of references associated with (i) Journal of Biogeography, (ii) a set of biogeography-related journals (Journal of Biogeography, Ecography, Global Ecology and Biogeography, and Diversity and Distributions), and in (iii) all journals.
Usage notes
The data and scripts included in this submission can be opened using the free and open R software.