# RAMP Data ## Data Collection RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar). Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL: • url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. • impressions: The number of times the URL appears within the SERP. • clicks: The number of clicks on a URL which took users to a page outside of the SERP. • clickThrough: Calculated as the number of clicks divided by the number of impressions. • position: The position of the URL within the SERP. • date: The date of the search. Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data. The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination: • country: The country from which the corresponding search originated. • device: The device used for the search. • impressions: The number of times the URL appears within the SERP. • clicks: The number of clicks on a URL which took users to a page outside of the SERP. • clickThrough: Calculated as the number of clicks divided by the number of impressions. • position: The position of the URL within the SERP. • date: The date of the search. Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available. More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en ## Data Processing Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No." The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch. Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data. ## About Citable Content Downloads Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use. CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above). For any specified date range, the steps to calculate CCD are: 1. Filter data to only include rows where "citableContent" is set to "Yes." 2. Sum the value of the "clicks" field on these rows. ## Output to CSV Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above. Also as noted above, daily data are downloaded for each IR in two sets which cannot be combined. One dataset includes the URLs of items that appear in SERP. The second dataset is aggregated by combination of the country from which a search was conducted and the device used. As a result, two CSV datasets are provided for each month of published data: ### page-clicks: The data in these CSV files correspond to the page-level data, and include the following fields: • url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property. • impressions: The number of times the URL appears within the SERP. • clicks: The number of clicks on a URL which took users to a page outside of the SERP. • clickThrough: Calculated as the number of clicks divided by the number of impressions. • position: The position of the URL within the SERP. • date: The date of the search. • citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No. • index: The Elasticsearch index corresponding to page click data for a single IR. • repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field. Filenames for files containing these data end with “page-clicks”. For example, the file named 2021-01_RAMP_all_page-clicks.csv contains page level click data for all RAMP participating IR for the month of January, 2021. ### country-device-info: The data in these CSV files correspond to the data aggregated by country from which a search was conducted and the device used. These include the following fields: • country: The country from which the corresponding search originated. • device: The device used for the search. • impressions: The number of times the URL appears within the SERP. • clicks: The number of clicks on a URL which took users to a page outside of the SERP. • clickThrough: Calculated as the number of clicks divided by the number of impressions. • position: The position of the URL within the SERP. • date: The date of the search. • index: The Elasticsearch index corresponding to country and device access information data for a single IR. • repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field. Filenames for files containing these data end with “country-device-info”. For example, the file named 2021-01_RAMP_all_country-device-info.csv contains country and device data for all participating IR for the month of January, 2021. ## References Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.