Skip to main content
Dryad logo

Repository Analytics and Metrics Portal (RAMP) 2017 data

Citation

Wheeler, Jonathan; Arlitsch, Kenning (2021), Repository Analytics and Metrics Portal (RAMP) 2017 data , Dryad, Dataset, https://doi.org/10.5061/dryad.r7sqv9scf

Abstract

The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2017. For a description of the data collection, processing, and output methods, please see the "methods" section below.

Methods

RAMP Data Documentation – January 1, 2017 through August 18, 2018

Data Collection

RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

Data from January 1, 2017 through August 18, 2018 were downloaded in one dataset per participating IR. The following fields were downloaded for each URL, with one row per URL:

  • url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
  • impressions: The number of times the URL appears within the SERP.
  • clicks: The number of clicks on a URL which took users to a page outside of the SERP.
  • clickThrough: Calculated as the number of clicks divided by the number of impressions.
  • position: The position of the URL within the SERP.
  • country: The country from which the corresponding search originated.
  • device: The device used for the search.
  • date: The date of the search.

Following data processing describe below, on ingest into RAMP an additional field, citableContent, is added to the page level data.

Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

Data Processing

Upon download from GSC, data are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the data which records whether each URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

Processed data are then saved in a series of Elasticsearch indices. From January 1, 2017, through August 18, 2018, RAMP stored data in one index per participating IR.

About Citable Content Downloads

Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.

CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).

For any specified date range, the steps to calculate CCD are:

  1. Filter data to only include rows where "citableContent" is set to "Yes."
  2. Sum the value of the "clicks" field on these rows.

Output to CSV

Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above.

The data in these CSV files include the following fields:

  • url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
  • impressions: The number of times the URL appears within the SERP.
  • clicks: The number of clicks on a URL which took users to a page outside of the SERP.
  • clickThrough: Calculated as the number of clicks divided by the number of impressions.
  • position: The position of the URL within the SERP.
  • country: The country from which the corresponding search originated.
  • device: The device used for the search.
  • date: The date of the search.
  • citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No.
  • index: The Elasticsearch index corresponding to page click data for a single IR.
  • repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the index field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.

Filenames for files containing these data follow the format 2017-01_RAMP_all.csv. Using this example, the file 2017-01_RAMP_all.csv contains all data for all RAMP participating IR for the month of January, 2017.

References

Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original

Usage Notes

Data have been compressed into zip files to reduce the filesize. Each zip file contains one CSV file.

The RAMP team values and appreciates the collaboration and contributions of RAMP participating IR. For a complete list of participating IR, please see the RAMP website at https://rampanalytics.org/ramp-repositories/. We also acknowledge Google, whose APIs are used to collect data.

RAMP data are audited for completeness and accuracy prior to publication. However, access to GSC data for any single repository may be impacted by repository migrations, configuration changes, rate limits, or other causes. Such issues may be temporary or occur over longer time periods, and the completeness and accuracy of the published data for individual repositories can therefore be variable. When such issues are known, every reasonable attempt is made to correct or retrieve missing or dropped data.

The variability that may exist within the dataset with regard to its completeness for individual repositories should be kep in mind when comparing data from two or more repositories. In particular, RAMP administrators strongly discourage qualitative assessments of any and all repositories, institutions, providers, and platforms represented within the dataset. 

An additional file is included with the dataset, "RAMP_repository_info.csv". This file includes information about the participating repositories whose data are included in the dataset, and may be helpful in filtering or subsetting the data. The file includes information about repository platforms and the country where each repository is located that is not included in the RAMP data and may be also be useful for subsetting data by platform or country. The fields in this CSV file are:

  • repository_id: This is the same as the repository_id field described in the RAMP data documentation above. This shared ID can be used to merge RAMP data with the platform and country information available in this file.
  • country: The two letter ISO code for the country in which the repository is located.
  • ir_platform: The software platform used by the repository. Note that some RAMP repositories have migrated platforms over time and may as a result be represented multiple times in this file. All of the major IR platforms are represented, as well as custom built applications.
  • name: The name of the repository. Typically, this includes the name of the host organization and the name of the repository

Please note that the "RAMP_repository_info.csv" file is generated from current RAMP configuration data, and may include information about repositories whose data are not in the dataset if they joined RAMP after 2017.

Funding

Institute of Museum and Library Services, Award: LG-72-18-0179

Institute of Museum and Library Services, Award: LG-06-14-0090