Name: Screenshots and metadata for 214 reCAPTCHA challenges encountered between September 2022 - September 2023
Creator: Ben Pettis

In Chapter 3 of my dissertation (tentatively titled " Becoming Users:Layers of People, Technology, and Power on the Internet. "), I describe how online user activities are datafied and monetized in subtle and often obfuscated ways. The chapter focuses on Google’s reCAPTCHA, a popular implementation of a CAPTCHA challenge. A CAPTCHA, or “Completely Automated Turning test to tell Computers and Humans Apart” is a simple task or challenge which is intended to differentiate between genuine human users and those who may be using software or other automated means to interact maliciously with a website, such as for spam, mass data scraping, or denial of service attacks. reCAPTCHA challenges are increasingly being hidden from direct view of the user, and instead assessing our mouse movements, browsing patterns, and other data to evaluate the likelihood that we are “authentic” users. These hidden challenges raise the stakes of understanding our own construction as Users because they obfuscate practices of surveillance and the ways that our activities as users are commodified by large corporations (Pettis, 2023). By studying the specifics of how such data collection works—that is, how we’re called upon and situated as Users—we can make more informed decisions about how we engage with the contemporary internet.

This data set contains metadata for the 214 reCAPTCHA elements that I encountered during my personal use of the Web for the period of one year (September 2022 through September 2023). Of these reCAPTCHAs, 137 were visible challenges—meaning that there was some indication of the presence of a reCAPTCHA challenge. The remaining 77 reCAPTCHAs were entirely hidden on the page. If I had not been running my browser extension, I would likely never have been aware of the use of a reCAPTCHA on the page. The data set also includes screenshots for 174 of the reCAPTCHAs. Screenshots that contain sensitive or private information have been excluded from public access. Researchers can request access to these additional files by contacting Ben Pettis <bpettis@wisc.edu>. A browsable and searchable version of the data is also available at https://capturingcaptcha.com

https://doi.org/10.5061/dryad.h70rxwdsr

Description of the data and file structure

Metadata about the reCAPTCHAs is all stored in a single JSON file in the "JSON Lines" format. This means that every line of the file contains a single record. For example:

{"_id":"b4f256f6-7503-42d9-9a18-b60c2331a7c6","status":3,"timestamp":{"$date":"2022-08-20T13:54:33.574Z"},"original_filename":"Screen Shot 2022-08-20 at 8.53.53 AM.png","new_filename":"b4f256f6-7503-42d9-9a18-b60c2331a7c6.png","privacy":true,"website_name":"Esurance","website_url":"esurance.com","website_type":"financial","website_type_other":"","visible":true,"challenge_description":"\"Protected by reCAPTCHA\" logo in the bottom corner","challenge_time":0,"challenge_attempts":0,"additional_description":"","accept_terms":true,"_keywords":["bottom","by","com","corner","esurance","financial","in","logo","protected","recaptcha","the"],"updated_at":{"$date":"2022-08-22T02:03:34.569Z"}}

Each record contains several fields:

_id - unique identifier for the record. This can be appended to https://capturingcaptcha.com/submissions/ (e.g. https://capturingcaptcha.com/submissions/6957d3f5-3aa5-4eea-9475-8be4cee74574) to view online
status - a numeric code to represent the data processing status of the record
- 1 Submitted, Pending Review (No File Uploaded)
- 2 Submitted, Pending Review and File Scan
- 3 Accepted Submission and File
- 4 Accepted Submission (No File Uploaded)
- 5 Withheld (hide entire submission from public website)
- 6 Pending Deletion
- 7 Possible Spam
timestamp - when the reCAPTCHA was submitted
original_filename - name of the file that was uploaded via the extension
new_filename - newly generated name for the file before it is stored in Cloud Storage
privacy
- true - hide screenshot from public access
- false - make screenshot public
website_name - human readable name of the source website
website_type - category of the website
website_type_other - if "Other" category was selected, this field contains a written description
challenge_description - written description of the type of reCAPTCHA challenge
challenge_time - estimated number of seconds to complete the reCAPTCHA challenge
challenge_attempts - number of attempts to successfully complete the reCAPTCHA challenge
additional_description - written description/additional notes
accept_terms - indicates whether I checked the "accept terms" box before submitting my screenshot
_keywords - automatically generated keywords to improve website search functionality

In addition to the JSON file, there is a directory containing public screenshot files. These files are named with the same value as "new_filename" in each record.

Sharing/Access information

Link to other publicly accessible locations of the data:

https://capturingcaptcha.com

Code/Software

I created a browser extension to support internet researchers interested in user interactions with specific Web elements. The extension searches for a specified HTML element and invites users to record their screens if it is detected. This is an adaptation of the development I've done for a project I am working on that is specifically interested in the Google reCAPTCHA and users' interaction with these challenges. I've spun off this side project so that other researchers can use this approach in their own projects. Its code is available at: https://github.com/bpettis/html-search-and-record

I developed a custom Google Chrome extension which detects when a page contains a reCAPTCHA and prompts the user to save a screenshot or screen recording while also collecting basic metadata. During Summer 2022, I began work on this website to collate and present the screen captures that I save throughout the year. The purpose of collecting these examples of websites where reCAPTCHAs appear is to understand how this Web element is situated within websites and presented to users, along with sketching out the frequency of their use and on what kinds of websites. Given that I will only be collecting records of my own interactions with reCAPTCHAs, this will not be a comprehensive sample that I can generalize as representative of all Web users. Though my experiences of the reCAPTCHA will differ from those of any other person, this collection will nevertheless be useful for demonstrating how the interface element may be embedded within websites and presented to users. Following Niels Brügger’s descriptions of Web history methods, these screen capture techniques provide an effective way to preserve a portion of the Web as it was actually encountered by a person, as opposed to methods such as automated scraping. Therefore my dissertation offers a methodological contribution to Web historians by demonstrating a technique for identifying and preserving a representation of one Web element within a page, as opposed to focusing an analysis on a whole page or entire website.

The browser extension is configured to store data in a cloud-based document database running in MongoDB Atlas. Any screenshots or video recordings are uploaded to a Google Cloud Storage bucket. Both the database and cloud storage bucket are private and are restricted from direct access. The data and screenshots are viewable and searchable at https://capturingcaptcha.com. This data set represents an export of the database as of June 10, 2024. After this date, it is possible that data collection will be resumed, causing more information to be displayed in the online website.

The data was exported from the database to a single JSON file (lines format) using the mongoexport command line tool:

mongoexport --uri mongodb+srv://[database-url].mongodb.net/production --collection submissions --out captcha-out.json --username [databaseuser]

Screenshots and metadata for 214 reCAPTCHA challenges encountered between September 2022 - September 2023

Data files

Abstract

Description of the data and file structure

Sharing/Access information

Code/Software

Screenshots and metadata for 214 reCAPTCHA challenges encountered between September 2022 - September 2023

Data files

Abstract

README: reCAPTCHAs

Description of the data and file structure

Sharing/Access information

Code/Software

Methods

Works referencing this dataset