Reddit blackout announcements: 2023 API protest

Published Feb 06, 2024 on Dryad. https://doi.org/10.5061/dryad.qfttdz0qd

Data files

Feb 06, 2024 version files 8.79 MB

2023_reddit_blackout_stickies.zip
8.78 MB
README.md
8.60 KB

Abstract

Starting June 12, 2023, many Reddit communities (subreddits) began a protest where they "went dark" - by changing to private mode - as a protest in response to Reddit's plans to change its API access policies and fee structure. Supporters of the protest criticize the planned changes for being prohibitively expensive for 3rd party apps. Beyond 3rd party apps, there is significant concern that the API changes are a move by the platform to increase monetization, degrade the user experience, and eventually kill off other custom features such as the old.reddit.com interface, the Reddit Enhancement Suite browser extension, and more. Additionally, there are concerns that the API changes will impede the ability of subreddit moderators (who are all unpaid users) to access tools to keep their communities on-topic and free of spam.

This dataset includes the "stickied" posts that appeared on 5,351 subreddits on June 11, 2023 and June 12, 2023 - including many subreddits announcing their plans to participate in the protest. These posts were scraped using a custom Python script that was written specifically for this purpose. Ironically, the script uses the PRAW (Python Reddit API Wrapper) library, requiring a valid Reddit API key. Accordingly, after the platform's new API pricing policy went into effect, it is no longer feasible for researchers to perform this type of web scraping without external funding support.

https://doi.org/10.5061/dryad.qfttdz0qd

Reddit Blackout Announcements - 2023 API Protest

This dataset includes the list of scraped subreddits, a single CSV file for each subreddit, and a copy of the Python scripts used to scrape the data.

Description of the data and file structure

The dataset is uploaded as a single .zip file. Once it is downloaded and decompressed, it will include several files and directories. Here is how they are organized
.
└── subreddit-list.txt
└── CSVs
└── [subreddit-name].csv
└── [...]
└── code
└── [...]
└── parsed TXTs
└── API.txt
└── blackout.txt
└── community.txt
└── mod-team.txt
└── moderator.txt
└── platform.txt
└── protest.txt

Subreddit List

The subreddit-list.txt file contains a list of 5,351 subreddit names. Each appears on its own line. This list was generated using the list-subreddits.py script, as described below.

Stickied Posts - CSVs

The "CSVs" directory contains 5,351 CSV (Comma Separated Value) files, each named with the subreddit that they were scraped from. The first row contains headers. Each of the following rows represents a single stickied post. Here is a sample from the /r/therewasanattempt subreddit

id - the identifier of the thread, as created by Reddit
created - the timestamp of when the thread was posted. This is a Unix timestamp, written in seconds past the epoch. Use a site such as https://www.epochconverter.com/ to convert to a human-readable time.
author - username of the account that posted the thread
title - the title of the thread
URL - Url to access the thread
text - the full text of the top post in the thread. This is written in Reddit Markdown. See: https://www.reddit.com/wiki/markdown/

id	created	author	title	url	text
1468896	1686423788	PlenitudeOpulence	Therewasanattempt to use 3rd Party Apps on Reddit: This subreddit will be doing an "Attempt-out" starting June 12th	https://www.reddit.com/r/t

Code

The code directory contains a copy of the GitHub repository as it appeared on January 22, 2024. See below for information on running the scripts

Parsed TXTs

This final directory contains a handful of plain text files. These were generated by parsing all of the information in all of the CSV files to pull out occurrences of certain words. This was achieved by interacting with the CSV files in a Bash shell to print out the contents of every CSV file and then use "grep" to filter for lines that include a specified word. For example:
cat *.csv | grep "blackout" >> blackout.txt

Sharing/Access information

The raw data is also available directly from the Google Cloud Storage Bucket that the files were initially written into.

Links to other publicly accessible locations of the data:

List of subreddits: https://storage.googleapis.com/reddit-blackout-announcements/subreddits.txt
All stickied posts: https://storage.googleapis.com/reddit-blackout-announcements/
Original GitHub repository: https://github.com/bpettis/reddit-blackout-announcements

Code/Software

Installation

I've created this script for my own research purposes, and so I can't necessarily guarantee that it will work in your environment. These notes are provided as reference, but I fully recognize that how I have things set up may not be best practices.

Dependencies

PRAW (Python Reddit API Wrapper)

Install packages:

pip install -r requirements.txt

Configuration

Google Cloud

The get-stickies.py script is set up to save the output data into a Google Cloud Storage Bucket. This can be handy for if you are planning on publicizing the data. You'll need to do a bit of setup.

You will need to create a Google Cloud project (or use an existing one)
You will need to create a Storage Bucket (with region/redundancy/privacy settings that fit your needs).
You will need to create a Service Account which has permission to write files into that bucket.
You will need to download a key for that service account in JSON format

The information of the above should be placed in a .env file in the root directory:

GCS_BUCKET_NAME='gcs_bucket_name_here'
GCP_PROJECT='project_name_here'
GOOGLE_APPLICATION_CREDENTIALS='keys/path/to/gcs-credentials.json'

If you want to disable Google Cloud Storage, just change use_gcs = True to use_gcs = False toward the top of the script

Cloud Logging

The get stickies.py scripy can also log information via Google Cloud logging. If you want to use this option, the service account that you use will need to have permission to write logs as well.

Set the name of the log by adding a line in the .env file in the root directory:

LOG_ID='reddit-blackout-announcements'

If you want to disable Google Cloud Storage, just change use_cloud_logging = True to use_cloud_logging = False toward the top of the script

Google Cloud Authentication

You will need to save your service account keys as a JSON file and place it in a place that the script can find. Set this file path using the GOOGLE_APPLICATION_CREDENTIALS environment variable

Reddit Authentication:

You'll need to have a Reddit account and generate Oauth2 credentials in order to authenticate to the Reddit API. Yes, this is a bit ironic given that this whole project is emerging in response to API changes.

Head to https://www.reddit.com/prefs/apps/ to create an app.

praw.ini

Instead of placing Reddit credentials directly in the script, I use an external praw.ini file to save configuration information. Put your credentials there, and not in the script and/or repo

Example praw.ini:

[app_name]
client_id=CLIENT_ID_HERE
client_secret=CLIENT_SECRET_HERE
user_agent=app_name_here:v0.0.1 (by /u/username_here)

Usage

list-subreddits.py

This script looks at three reddit posts and grabs the list of participating subreddits:

It uses the requests library to get the HTTP response body. Then it uses re to search for links that look like <a href="/r/iphone/">r/iphone</a>, e.g. what the list looks like in the post. Next it's just a bit of string cleanup and then writing to an output file.

This script does not use the Reddit API at all. It's just basic HTTP requests.

Set the location and name of the output file at the top of the script

Set location and name of output file here

output = 'output/subreddits.txt'

get-stickies.py

This is a slightly modified version of the script that I had previously written to preserve other subreddit info: https://github.com/bpettis/reddit-scrape_mods-rules

It will create a CSV file for each of the listed subreddits. Each row of the CSV represents a stickied post. There currently isn't any logic to try and detect which post is the one announcing the blackout. I'm just saving all of them.

Set the list of input subreddits in input_list = 'output/subreddits.txt at the top of the script

That file should be the one created by list-subreddits.py

*NOTE: This script is only going to get the stickied posts from each of the specified subreddits. If the subreddit doesn't have a sticky about their participation in the blackout, it won't be fully represented here. The script will raise a prawcore.exceptions.NotFound exception and continue on. This means you'll get some CSV files that look incomplete.