Skip to main content
Dryad

RAMP data subset, January 1 through May 31, 2019

Cite this dataset

Wheeler, Jonathan et al. (2020). RAMP data subset, January 1 through May 31, 2019 [Dataset]. Dryad. https://doi.org/10.5061/dryad.fbg79cnr0

Abstract

The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://ramp.montana.edu/), consisting of data from 35 (out of 50) participating institutional repositories (IR) from the period of January 1 through May 31, 2019. This subset represents data analyzed for a pending publication. For a description of the data collection, processing, and output methods, please see the "methods" section below.

The 'RAMP Primer,' a Jupyter Notebook consisting of Python code for combining monthly data and generating some aggregate statistics is available from https://github.com/imls-measuring-up/ramp-documentation.git. The linked repository also includes data documentation similar to that provided in the "methods" section below, as well as a file of IR index names useful for subsetting and filtering the data.

Update: Version two of this dataset was uploaded on January 14, 2020. Thanks to RAMP participants, the RAMP administrators discovered an error in the daily data harvest that resulted in incomplete page-click data for roughly 15 RAMP participating repositories. Only page-click data as described below were affected, and the corrsponding CSV files have been replaced with corrected data. The country-device data were not affected and have not been changed from version 1.

Methods

Data Collection

RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:

  • url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
  • impressions: The number of times the URL appears within the SERP.
  • clicks: The number of clicks on a URL which took users to a page outside of the SERP.
  • clickThrough: Calculated as the number of clicks divided by the number of impressions.
  • position: The position of the URL within the SERP.
  • date: The date of the search.

The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:

  • country: The country from which the corresponding search originated.
  • device: The device used for the search.
  • impressions: The number of times the URL appears within the SERP.
  • clicks: The number of clicks on a URL which took users to a page outside of the SERP.
  • clickThrough: Calculated as the number of clicks divided by the number of impressions.
  • position: The position of the URL within the SERP.
  • date: The date of the search.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.

Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.

About Citable Content Downloads

Data visualizations and aggregations in the RAMP interface accessible from http://ramp.montana.edu/ present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.

CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).

For any specified date range, the steps to calculate CCD are:

  1. Filter data to only include rows where "citableContent" is set to "Yes."
  2. Sum the value of the "clicks" field on these rows.

Output to CSV

Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above. Also as noted above, daily data are downloaded for each IR in two sets which cannot be combined. One dataset includes the URL. The second dataset is aggrgated by the country from which a search was conducted and the device used.

As a result, two CSV datasets are provided for each month of published data:

page-clicks:

The data in these CSV files correspond to the page-level data, and include the following fields:

  • url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
  • impressions: The number of times the URL appears within the SERP.
  • clicks: The number of clicks on a URL which took users to a page outside of the SERP.
  • clickThrough: Calculated as the number of clicks divided by the number of impressions.
  • position: The position of the URL within the SERP.
  • date: The date of the search.
  • citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No.
  • index: The Elasticsearch index corresponding to page click data for a single IR. Since the monthly CSV files include data for all participating IR (or all IR included in a subset), index names are needed to extract data for individual IR, or groups of IR.

Filenames for files containing these data end with “page-clicks”. For example, the file named 2019-01_RAMP_subset_page-clicks.csv contains page level click data for a subset of 35 RAMP participating IR for the month of January, 2019.

country-device-information:

The data in these CSV files correspond to the data aggregated by country from which a search was conducted and the device used. These include the following fields:

  • country: The country from which the corresponding search originated.
  • device: The device used for the search.
  • impressions: The number of times the URL appears within the SERP.
  • clicks: The number of clicks on a URL which took users to a page outside of the SERP.
  • clickThrough: Calculated as the number of clicks divided by the number of impressions.
  • position: The position of the URL within the SERP.
  • date: The date of the search.
  • index: The Elasticsearch index corresponding to country and device access information data for a single IR. Since the monthly CSV files include data for all participating IR (or all IR included in a subset), index names are needed to extract data for individual IR, or groups of IR.

Filenames for files containing these data end with “country-device-info”. For example, the file named 2019-01_RAMP_subset_country-device-info.csv contains country and device data for all participating IR for the month of January, 2019.

References

Google, Inc. (2019). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original

Usage notes

The RAMP team values and appreciates the collaboration and contributions of RAMP participating IR. We also acknowledge Google, whose APIs are used to collect data. If these data are used for additional analysis, please cite as follows:

Wheeler, Jonathan, Kenning Arlitsch, Minh Pham, Nikolaus Parulian, Patrick OBrien, Jeff Mixter, Montana State University ScholarWorks, University of New Mexico Digital Repository, McMaster University MacSphere, Maryland Shared Open Access Repository (MD SOAR), Digital Repository at the University of Maryland (MD DRUM), Mountain Scholar Digital Collections of Colorado and Wyoming, University of Michigan Deep Blue, Rutgers University RUcore Institutional Repository, Kansas State University Research Exchange (K-REx) , Swarthmore College Works, Bryn Mawr, Haverford, and Swarthmore Colleges: TriCollege Libraries Institutional Repository, University of Oklahoma, Oklahoma State University, and the University of Central Oklahoma: SHAREOK Repository, University of Nevada Digital Scholarship@UNLV, University of Kentucky UKnowledge, Swedish University of Agricultural Sciences Epsilon Open Archive, Swedish University of Agricultural Sciences Epsilon Archive for Student Projects, Northern Kentucky University Digital Repository, Massey University Massey Research Online, University of Waterloo UWSpace, Caltech Library CaltechAUTHORS Repository, Caltech Library CaltechTHESIS Repository, University of Texas at Austin Texas ScholarWorks, Northeastern University Digital Repository Service, University of Pittsburgh D-Scholarship@Pitt, University of the Western Cape Electronic Theses and Dissertations Repository, University of the Western Cape Research Repository, Royal Roads University and Vancouver Island University VIURRSpace, University of Strathclyde Strathprints, University of Montana ScholarWorks, Virginia Tech VTechWorks, Sam Houston State University Scholarly Works @ SHSU, Indiana University - Purdue University Indianapolis (IUPUI) ScholarWorks, University of Plymouth PEARL, Digital Commons @ The University of Nebraska Lincoln, University of Wollongong Australia: Research Online (2019). RAMP data subset, January 1 through May 31, 2019, University of New Mexico, Dataset, https://doi.org/10.5061/dryad.fbg79cnr0

Funding

Institute of Museum and Library Services, Award: LG-06-14-0090

Institute of Museum and Library Services, Award: LG-72-18-0179