Supersharers of fake news on Twitter

Baribi-Bartov, Sahar1 ; Swire-Thompson, Briony 2 ; Grinberg, Nir 1

Published May 24, 2024 on Dryad. https://doi.org/10.5061/dryad.44j0zpcmq

Data files

May 24, 2024 version files 2.98 MB

Archive20240523.zip
2.97 MB
README.md
6.52 KB

Abstract

Governments may have the capacity to flood social media with fake news, but little is known about the use of flooding by ordinary voters. In this work, we identify 2107 registered US voters that account for 80% of the fake news shared on Twitter during the 2020 US presidential election by an entire panel of 664,391 voters. We find that supersharers are important members of the network, reaching a sizable 5.2% of registered voters on the platform. Supersharers have a significant overrepresentation of women, older adults, and registered Republicans. Supersharers' massive volume does not seem automated but is rather generated through manual and persistent retweeting. These findings highlight a vulnerability of social media for democracy, where a small group of people distort the political reality for many.

This repository contains data and code for replication of the results presented in the paper.

The folders are mostly organized by research questions as detailed below. Each folder contains the code and publicly available data necessary for the replication of results. Importantly, no individual-level data is provided as part of this repository. De-identified individual-level data can be attained for IRB-approved uses under the terms and conditions specified in the paper. Once access is granted, the restricted-access data is expected to be located under ./restricted_data.

The folders in this repository are the following:

Preprocessing

Code under the preprocessing folder contains the following:

source classifier - the code used to train a classifier based on NewsGuard domain flags to match the fake news labels source definition use in Grinberg et el. 2019 labels.
political classifier - the code used to identify political tweets, including daily training of the models.
sample creation - the code used to identify supersharers in the larger sample of active panelists as well as the reference groups.

RQ0 Prevalence

Data and replication code for Figure 1 in the main paper as generated by 01_prevalence.Rmd. The data in this folder is the following:

fig1_panelA_data.csv

date - value ranging from 2020-08-01 to 2020-11-30
total_shared_pol_tweets - aggregate number of political news tweets shared by the panel.
total_shared_fake_pol_tweets - aggregate number of political fake news news tweets shared by the panel.
ss_fake_pol_tweets - aggregate number of political fake news tweets shared by supersharers.

Fig1_panelB_data.csv.gz

cat - content category: political news or fake news.
pct_ppl - percent of people that shared this percent of the aggregate number of content.
pct_shared - percent of the aggregate amount of content shared.

Fig1_panelC_data.csv.gz

bins - percentile.
cat - group category: supersharers (ss-fn) or SS-NF (ss-pol).
n - numbers of users in this bin.
percent - percent of users in bin.

article_level_claims_validation.csv

fake news source - yes/no indication of whether the link originated from a fake news source
article link / claim - article URL or main claim.
verifiably false - is the main article claim verifiably false.
fact-checking - URL for fact-checked information.
fact-checking additional - additional URL for fact-checked information.

RQ1

Code under the RQ1 folder contains the following:

01_reach.ipynb - Python (mostly pySpark) code for calculating various measures of supersharers’ reach.
02_importance.Rmd - Code for replicating statistical testing of the calculated measures of network (topological) importance and engagement in RQ1. Requires access to the restricted-access data.

RQ2

Data and replication code for RQ2 including:

01_regressions.Rmd - Code for the regressions comparing supersharers to reference groups. Requires access to the restricted-access data.
02_fig2.Rmd - Code for producing Figure 2. Panels A, C, and D can be recreated without access to restricted-access data. Panel B, which shows the full age distribution across groups, requires restricted data.
03_figS5.Rmd - Code for producing Figure S5 in the SM.13.

The data in this folder is the following:

Fig2_panelA_data.csv

cat - group category: Supersharers, SS-NF, Panel, and AVG. fake.
gender - Male, Female, or Unkown.
prp - proportion of male, female, or unknown gender in the group.
prp_lo - the above proportion lower 95% CI.
prp_hi - the above proportion upper 95% CI.

Fig2_panelC_data.csv

cat - group category: Supersharers, SS-NF, Panel, and AVG. fake.
party_reg - party of registration: Democrat (Dem.), Republican (Rep.), or Independent (Ind.).
prp - proportion of registered Democrats, Republicans, or Independents in the group.
prp_lo - the above proportion lower 95% CI.
prp_hi - the above proportion upper 95% CI.

Fig2_panelD_data.csv

cat - group category: Supersharers, SS-NF, Panel, and AVG. fake.
prp - proportion of Caucasians in the group.
prp_lo - the above proportion lower 95% CI.
prp_hi - the above proportion upper 95% CI.

Fig_S5_state_aggregated_percent.csv

state_x - US state.
cat - group category: Supersharers (ss-fn), SS-NF (ss-pol), Panel (other), and AVG. fake (fake_sharer).
percent - percent of panelists in state from this group.

RQ3

Data and replication code for RQ3 including:

01_temporal_analysis.ipynb - pySpark code for calculating various time of day, interval between subsequent posts, etc. to facilitate temporal analysis described in SM.6.
02_fig_S3.Rmd - code for reproducing Figure S3.

The data in this folder is the following:

bot_manual_labeling.csv

This is the manual labeling of a sample of accounts as described in SM.6.

bot_or_not - indication of the annotator label of the account as potentially bot or not.
cat - group category: Supersharers (ss-fn), SS-NF (ss-pol), or Panel (other).

Automation_sharing_hours.csv

cat - group category: Supersharers (ss-fn), SS-NF (ss-pol), or Panel (other).
hour_of_the_day - hour of the day.
count - aggregate number of tweets shared by the group in this hour.

Automation_time_between_tweets.csv

cat - group category: Supersharers (ss-fn), SS-NF (ss-pol), or Panel (other).
bins_cat - bin category: minutes or seconds.
bins - bin of time between tweets.
prp - proportion of tweets in bin in this group.
prp_lo - the above proportion lower 95% CI.
prp_hi - the above proportion upper 95% CI.

Automation_session_length.csv

cat - group category: Supersharers (ss-fn), SS-NF (ss-pol), or Panel (other).
bins - time bin of session lengths.
prp - proportion of sessions in bin in this group.
prp_lo - the above proportion lower 95% CI.
prp_hi - the above proportion upper 95% CI.

Automation_sessions_count.csv

cat - group category: Supersharers (ss-fn), SS-NF (ss-pol), or Panel (other).
avg_sessions - average number of sessions per day.
prp - percent of users in this group with this number of sessions per day.
prp_lo - the above proportion lower 95% CI.
prp_hi - the above proportion upper 95% CI.

Utils

The code used for plotting in R.