Trends in research approaches and gender in plant ecology dissertations over four decades
Data files
May 28, 2024 version files 60.71 MB
Abstract
Dissertations are a foundational scientific product; they are the formative product that early-career scientists create and share original knowledge. Methodological approaches used in dissertations vary depending on the research field. In plant ecology, these approaches include observations, experiments (field or controlled-environment), literature reviews, theoretical approaches, or analyses of existing data (including ‘big data’). This dataset was created to examine how the emphasis on each of these categories has changed over time, and whether male and female authors differ in the methods employed. The dissertations used for this study were gathered from the Proquest Dissertations and Theses Global (https://www.proquest.com/pqdtglobal) database.
README: Trends in methodological approaches and gender in plant ecology dissertations, 1938-2021
https://doi.org/10.5061/dryad.h44j0zprx
Dissertations are a foundational scientific product; they are the formative product that early-career scientists create and share original knowledge. Methodological approaches used in dissertations vary with the research field. In plant ecology, these approaches include observations, experiments (field or controlled-environment), literature reviews, theoretical approaches, or analyses of existing data (including ‘big data’). This dataset was created to examine how the emphasis on each of these categories has changed over time, and whether male and female authors differ in methods employed. The dissertations used for this study were gathered from the Proquest Dissertations and Theses Global (https://www.proquest.com/pqdtglobal database.
Description of the data and file structure
Datafiles files:
1.ProquestDownload.csv - Original search results and metadata from each search result (dissertation) downloaded from Proquest Dissertations & Theses Global. This is the raw file as downloaded from Proquest without any changes made by the author. Column names are explained below:
Column name | Explanation |
---|---|
Title | Thesis title |
Abstract | Thesis abstract |
StoreId | Unique ID for the thesis in the Proquest database |
AlternateTitle | Alternate title, if any |
ArticleType | Article type, i.e. thesis, journal article, etc. |
Authors | Name of author (Last name,first name) |
companies | [meaning unknown; not used in analysis] |
copyright | Copyright information |
digitalObjectIdentifier | Digital object identifier (DOI), if any |
documentType | Type of document, i.e. thesis, journal article etc. (similar to ArticleType) |
entryDate | Year of publication |
isbn | International Standard Book Number (ISBN) for the thesis (if any) |
language | Language in which the thesis was written |
languageOfSummary | Language in which the abstract/summary was written |
originalTitle | Original thesis tile (usually same as Title column) |
pubdate | Year of publication (same as entryDate column) |
pubtitle | Title of publication (usually same as Title and orginalTitle column) |
year | Year of publication (same as pubdate column) |
DocumentURL | URL link to the Proquest page for the given thesis |
classification | Thesis subject, as categorized by Proquest (e.g. ecology, botany, plant biology, etc.) |
classificationCodes | Same as the classification column |
majorClassificationCodes | Same as the classification column |
notes | Miscellaneous information |
subjectClassifications | Same as the classification column |
subjectTerms | keywords related to the topic/topics covered by the thesis |
subjects | Additional keywords related the thesis topic (some overlap with SubjectTerms) |
URL | URL to any other webpage where the thesis/thesis information is available |
FindACopy | URL for finding a copy of the full text through Stony Brook University Library (which provided access to Proquest) |
Database | Database from which the document information was obtained (among the various Proquest databases) |
2.UsableDissertations.csv - Search results that were usable (i.e., abstracts available and in English). This was determined by manually reading the entries in the ProquestDownload.csv file. Column names same as in 1.ProquestDownload.csv.
3.RelevantDissertations.csv - Usable dissertations (search results) classified into relevant or non-relevant for this study (see methods for details). Column names: Relevant? = whether the dissertation is relevant ('y') or not ('n'); all other column names same as in 1.ProquestDownload.csv.
4.Classified.csv - Dissertations classified into the methodological categories (see methods for details). Column names: Title - thesis title, Abstract - thesis abstract, Authors - author name, StoreId - unique ID number for thesis in the Proquest database, obs - Observational, exp-c - Controlled-environment experiments, exp-f - Field experiments, lit - Literature-based, dat - Database study, mat - Mathematical modelling & simulations, sum - total number of categories in a dissertation, used full text - whether the full-text was used for classification (y=yes, n=no). In columns 7-11, 1 indicates that the given dissertation used that methodological approach, and 0 indicates that the given dissertation did not use that approach.
5.MethodsTabulations.csv - Number and percentage of dissertations in each methodological categories, across time. These tabulations were carried out in Excel. Column names self-explanatory or have the same meaning as in Classifed.csv.
6.Gender.csv - Gender of dissertation authors, for all relevant dissertations, as determined by genderize.io and through online profiles. Column names are explained below:
Column name | Explanation |
---|---|
Title | Thesis tile |
Year | Year of publication |
StoreId | Unique ID number for each thesis |
Last name | Last name of author |
First name | First name of author |
middle 1 | Middle name of author, if any. If author has >1 middle names, they were split into middle 1, middle 2, etc. |
middle 2 | See middle 1 |
middle 3 | See middle 1 |
genderizeGender | Gender of the author, as determined by genderize.io |
genderProbability | gender probability, i.e., probability of the author name belonging to the assigned gender, as given by genderize.io |
genderCount | name count, i.e., number of records of the corresponding name in the genderize.io database |
genderFinal | final gender assignment of the author, after considering the gender probability and gender count cut-offs (see methods) and after looking at middle names and online profiles, if necessary (unknown = gender could not be determined either by genderize.io or by online profiles, unkown2 = gender was determined by genderize.io, but either the gender probability or the gender count fell below the cut-off, and no further information was found from middle names or online profiles) |
notes | notes - whether online profiles and/or middle names were used for gender determination |
7.GenderByCategory.csv - Gender of dissertation authors for the dissertations classified by methodology. Generated using the code in gender.R. Column names same as Gender.csv and Classified.csv.
8.GenderTabulations.csv - Gender ratio over time, calculated from the Gender.csv data. Calculations carried out in Excel. Column names: Time Period - time period, Num_Male - number of male authors, Num_Female - number of female authors, Num_Unknown - Number of authors whose gender is unknown, Total_Genderknown - total number of authors whose gender could be determined for the given time period, Total - total number of dissertations from the given time period, %male - percentage male ( = Num_Male/Total_Genderknown), %female - percentage female ( = Num_Female/Total_Genderknown), male/female - male:female ratio ( = Num_Male/Num_Female)
9.GenderCategoryTabulation.csv - Gender ratio across methodological categories, calculated from GenderByCategory.csv. Calculations carried out in Excel. Column names: Category - methodology category, Num_Male - number of male authors, Num_Female - number of female authors, Num_Unknown - Number of authors whose gender is unknown, Total_Genderknown - total number of authors whose gender could be determined for the given time period, Total - total number of dissertations from the given time period, %male - percentage male ( = Num_Male/Total_Genderknown), %female - percentage female ( = Num_Female/Total_Genderknown), male/female - male:female ratio ( = Num_Male/Num_Female)
ProquestDownload2.txt - Another download of search results from the Proquest database. This is not a tabular data file as Proquest provides location information (country of dissertation author) only when search results are downloaded in this format.
10.StudyLocations.csv - Tabulation of number of dissertations from each country. Generated using the code in StudyLocations.R
11.Study_Locations_forClassified.csv - Tabulation of number of dissertations from each country, among the dissertations that were classified by methodology. Generated using the code in StudyLocations.R.
Code files:
gender.R - code for joining gender information (from Gender.csv) to methodological information (from Classified.csv). This was done so that the gender ratio in each methodological category could be calculated.
Graphs.R - code for generating graphs on temporal trends in methodological categories and gender ratio in plant ecology. These graphs are shown in the publication associated with this dataset.
StudyLocations.R - code for extracting location information (country of each dissertation author) from ProquestDownload2.txt, in order to calculate the number of dissertations from each country.
Sharing/Access information
Code/Software
Most of the analysis was carried out in Excel. Graphing and some data wrangling was carried out in R, the code for which is provided (see description of data and file structure).
Methods
The Proquest Dissertations & Theses Global Database (https://www.proquest.com/pqdtglobal) was used to find relevant dissertations. The website was accessed on January 4, 2022, and the following search string was used:
((SU(plant? OR vegetation OR tree OR leaf OR botan* OR flora* OR seedling* OR grass*) OR TI(plant? OR vegetation OR tree OR leaf OR botan* OR flora* OR seedling* OR grass*)) AND (SU(ecology OR ecolog* OR ecosystem? OR communit* OR conservation OR diversity OR biodiversity OR range OR trait?) OR TI(ecology OR ecolog* OR ecosystem? OR communit* OR conservation OR diversity OR biodiversity OR range OR trait?)) AND LA(en OR eng OR english)) NOT ("water reclamation" plant? OR "water treatment" plant? OR econom* OR bioengineer* OR biotechnolog* OR "bacterial flora")
The search results were further filtered as follows:
Manuscript type: Doctoral dissertations; Language: English
Subject: NOT (plant pathology AND genetics AND agronomy AND microbiology AND soil sciences AND zoology AND plant propagation AND molecular biology AND business community AND livestock AND wildlife management AND animal behavior AND physical geography AND cartography AND fish production AND behavioral sciences AND civil engineering AND enzymes)
This search returned 5423 results. These were then manually screened for metadata completeness (e.g., author name, year of publication) and availability of abstract. Dissertations with incomplete information, particularly missing abstract or year of publication, or those which were not in English, were removed. This left us with 3832 studies. These were then screened for relevance, based on whether the study focused on a topic in plant ecology. This was done by reading titles and abstracts. Relevant studies were defined as those that focused on one or more plant species or communities, and that studied the interactions of those plant species/communities with other organisms or with the environment. Only studies on embryophytes (i.e., bryophytes and vascular plants) were included. Other taxonomic groups that have been traditionally included under the term “plants”, such as algae and fungi, were excluded. After screening for relevance, 2670 dissertations remained.
Methodological Classification
We initially selected 20% of the relevant dissertations at random for classification by category. After this initial classification, an additional 5% were added; as the proportions among categories were stable at that point, we felt that this was a sufficient sample of the approaches taken for the total population of studies for each decade. In total, 670 samples were classified.
Classification was carried out primarily by reading the abstracts. However, if the abstract did not contain sufficient information for unambiguous classification, the full text was then used, if available. If a thesis had both insufficient information in the abstract and lack of full text availability, it was removed and replaced by another randomly chosen dissertation from the same decade.
The classified dissertations were divided into 10-year time periods based on year of publication, for quantifying temporal trends. Note that studies from 2020-2021 were placed in the 2010s time-period. Studies from before 1980 were grouped into a single period, due to the small number of such studies.
Gender Classification
The gender of each thesis author was determined using genderize.io (https://genderize.io/), an application programing interface (API) that uses social media records to predict a person’s gender from their given name. Along with a gender classification, the program also provides the probability of the name belonging to that gender, and the number of past records of that name in the API’s database. We set a probability cut-off of 0.8 and a past record cut-off of 50, to reduce the chances of false classifications (names that fell below either of these cutoffs were considered undermined). For authors whose gender could not be determined by their first names, we re-ran the algorithm on their middle names. if available. If middle names were not available, or if gender could not be determined from the middle name, we looked up online profiles (e.g., institutional profile, lab website, etc.) of the author, and attempted to determine gender from their photographs. Any remaining undetermined records were removed from the gender analysis. We were able to determine the gender of 2392 authors, out of a total of 2760 (ca. 87% of the dataset).
A Chi-squared goodness of fit tests was used to determine whether these gender ratios in each decade significantly differed from a 1:1 ratio (equal male and female representation). We also considered whether different methodologies were more or less likely to be used by either gender, using the dissertations that had been classified by methodology. This was done by comparing the gender ratio of authors in each methodology category to the overall gender ratio, using Chi-squared goodness of fit tests.