Reddit blackout announcements: 2023 API protest
Data files
Feb 06, 2024 version files 8.79 MB
Abstract
Starting June 12, 2023, many Reddit communities (subreddits) began a protest where they "went dark" - by changing to private mode - as a protest in response to Reddit's plans to change its API access policies and fee structure. Supporters of the protest criticize the planned changes for being prohibitively expensive for 3rd party apps. Beyond 3rd party apps, there is significant concern that the API changes are a move by the platform to increase monetization, degrade the user experience, and eventually kill off other custom features such as the old.reddit.com interface, the Reddit Enhancement Suite browser extension, and more. Additionally, there are concerns that the API changes will impede the ability of subreddit moderators (who are all unpaid users) to access tools to keep their communities on-topic and free of spam.
This dataset includes the "stickied" posts that appeared on 5,351 subreddits on June 11, 2023 and June 12, 2023 - including many subreddits announcing their plans to participate in the protest. These posts were scraped using a custom Python script that was written specifically for this purpose. Ironically, the script uses the PRAW (Python Reddit API Wrapper) library, requiring a valid Reddit API key. Accordingly, after the platform's new API pricing policy went into effect, it is no longer feasible for researchers to perform this type of web scraping without external funding support.
README: Reddit Blackout Announcements - 2023 API Protest
https://doi.org/10.5061/dryad.qfttdz0qd
Reddit Blackout Announcements - 2023 API Protest
This dataset includes the list of scraped subreddits, a single CSV file for each subreddit, and a copy of the Python scripts used to scrape the data.
Description of the data and file structure
The dataset is uploaded as a single .zip file. Once it is downloaded and decompressed, it will include several files and directories. Here is how they are organized
.
└── subreddit-list.txt
└── CSVs
└── [subreddit-name].csv
└── [...]
└── code
└── [...]
└── parsed TXTs
└── API.txt
└── blackout.txt
└── community.txt
└── mod-team.txt
└── moderator.txt
└── platform.txt
└── protest.txt
Subreddit List
The subreddit-list.txt file contains a list of 5,351 subreddit names. Each appears on its own line. This list was generated using the list-subreddits.py script, as described below.
Stickied Posts - CSVs
The "CSVs" directory contains 5,351 CSV (Comma Separated Value) files, each named with the subreddit that they were scraped from. The first row contains headers. Each of the following rows represents a single stickied post. Here is a sample from the /r/therewasanattempt subreddit
- id - the identifier of the thread, as created by Reddit
- created - the timestamp of when the thread was posted. This is a Unix timestamp, written in seconds past the epoch. Use a site such as https://www.epochconverter.com/ to convert to a human-readable time.
- author - username of the account that posted the thread
- title - the title of the thread
- URL - Url to access the thread
- text - the full text of the top post in the thread. This is written in Reddit Markdown. See: https://www.reddit.com/wiki/markdown/
id | created | author | title | url | text |
---|---|---|---|---|---|
1468896 | 1686423788 | PlenitudeOpulence | Therewasanattempt to use 3rd Party Apps on Reddit: This subreddit will be doing an "Attempt-out" starting June 12th | https://www.reddit.com/r/t |
Code
The code directory contains a copy of the GitHub repository as it appeared on January 22, 2024. See below for information on running the scripts
Parsed TXTs
This final directory contains a handful of plain text files. These were generated by parsing all of the information in all of the CSV files to pull out occurrences of certain words. This was achieved by interacting with the CSV files in a Bash shell to print out the contents of every CSV file and then use "grep" to filter for lines that include a specified word. For example:
cat *.csv | grep "blackout" >> blackout.txt
Sharing/Access information
The raw data is also available directly from the Google Cloud Storage Bucket that the files were initially written into.
Links to other publicly accessible locations of the data:
- List of subreddits: https://storage.googleapis.com/reddit-blackout-announcements/subreddits.txt
- All stickied posts: https://storage.googleapis.com/reddit-blackout-announcements/
- Original GitHub repository: https://github.com/bpettis/reddit-blackout-announcements
Code/Software
Installation
I've created this script for my own research purposes, and so I can't necessarily guarantee that it will work in your environment. These notes are provided as reference, but I fully recognize that how I have things set up may not be best practices.
Dependencies
Install packages:
pip install -r requirements.txt
Configuration
Google Cloud
The get-stickies.py script is set up to save the output data into a Google Cloud Storage Bucket. This can be handy for if you are planning on publicizing the data. You'll need to do a bit of setup.
- You will need to create a Google Cloud project (or use an existing one)
- You will need to create a Storage Bucket (with region/redundancy/privacy settings that fit your needs).
- You will need to create a Service Account which has permission to write files into that bucket.
- You will need to download a key for that service account in JSON format
The information of the above should be placed in a .env file in the root directory:
GCS_BUCKET_NAME='gcs_bucket_name_here'
GCP_PROJECT='project_name_here'
GOOGLE_APPLICATION_CREDENTIALS='keys/path/to/gcs-credentials.json'
If you want to disable Google Cloud Storage, just change use_gcs = True to use_gcs = False toward the top of the script
Cloud Logging
The get stickies.py scripy can also log information via Google Cloud logging. If you want to use this option, the service account that you use will need to have permission to write logs as well.
Set the name of the log by adding a line in the .env file in the root directory:
LOG_ID='reddit-blackout-announcements'
If you want to disable Google Cloud Storage, just change use_cloud_logging = True to use_cloud_logging = False toward the top of the script
Google Cloud Authentication
You will need to save your service account keys as a JSON file and place it in a place that the script can find. Set this file path using the GOOGLE_APPLICATION_CREDENTIALS environment variable
Reddit Authentication:
You'll need to have a Reddit account and generate Oauth2 credentials in order to authenticate to the Reddit API. Yes, this is a bit ironic given that this whole project is emerging in response to API changes.
Head to https://www.reddit.com/prefs/apps/ to create an app.
praw.ini
Instead of placing Reddit credentials directly in the script, I use an external praw.ini file to save configuration information. Put your credentials there, and not in the script and/or repo
Example praw.ini:
[app_name]
client_id=CLIENT_ID_HERE
client_secret=CLIENT_SECRET_HERE
user_agent=app_name_here:v0.0.1 (by /u/username_here)
Usage
list-subreddits.py
This script looks at three reddit posts and grabs the list of participating subreddits:
- https://www.reddit.com/r/ModCoord/comments/1401qw5/incomplete_and_growing_list_of_participating/
- https://www.reddit.com/r/ModCoord/comments/143fzf6/incomplete_and_growing_list_of_participating/
- https://www.reddit.com/r/ModCoord/comments/146ffpb/incomplete_and_growing_list_of_participating/
It uses the requests library to get the HTTP response body. Then it uses re to search for links that look like <a href="/r/iphone/">r/iphone</a>, e.g. what the list looks like in the post. Next it's just a bit of string cleanup and then writing to an output file.
This script does not use the Reddit API at all. It's just basic HTTP requests.
Set the location and name of the output file at the top of the script
Set location and name of output file here
output = 'output/subreddits.txt'
get-stickies.py
This is a slightly modified version of the script that I had previously written to preserve other subreddit info: https://github.com/bpettis/reddit-scrape_mods-rules
It will create a CSV file for each of the listed subreddits. Each row of the CSV represents a stickied post. There currently isn't any logic to try and detect which post is the one announcing the blackout. I'm just saving all of them.
Set the list of input subreddits in input_list = 'output/subreddits.txt at the top of the script
That file should be the one created by list-subreddits.py
*NOTE: This script is only going to get the stickied posts from each of the specified subreddits. If the subreddit doesn't have a sticky about their participation in the blackout, it won't be fully represented here. The script will raise a prawcore.exceptions.NotFound exception and continue on. This means you'll get some CSV files that look incomplete.
Methods
The list of subreddits was created from the ist of participating subreddits that had been collated in the /r/ModCoord subreddit. An initial Python script looks at three reddit posts and grabs the list of participating subreddits:
- https://www.reddit.com/r/ModCoord/comments/1401qw5/incomplete_and_growing_list_of_participating/
- https://www.reddit.com/r/ModCoord/comments/143fzf6/incomplete_and_growing_list_of_participating/
- https://www.reddit.com/r/ModCoord/comments/146ffpb/incomplete_and_growing_list_of_participating/
It uses the requests
library to get the HTTP response body. Then it uses re
to search for links that look like <a href="/r/iphone/">r/iphone</a>
, e.g. what the list looks like in the post. Next it's just a bit of string cleanup and then writing to an output file.
This script does not use the Reddit API at all. It's just basic HTTP requests.
A second Python script then reads that list and uses the Reddit API to request information about current posts in each subreddit. The script creates a CSV file for each of the listed subreddits and creates a new row for each "stickied" post. There currently isn't any logic to try and detect which post is the one announcing the blackout; I simply saved all of them. Many subreddits did not have any stickied posts at all, and many stickied posts were not related to the blackout.