Supersharers of fake news on Twitter
Data files
May 24, 2024 version files 2.98 MB
-
Archive20240523.zip
-
README.md
Abstract
Governments may have the capacity to flood social media with fake news, but little is known about the use of flooding by ordinary voters. In this work, we identify 2107 registered US voters that account for 80% of the fake news shared on Twitter during the 2020 US presidential election by an entire panel of 664,391 voters. We find that supersharers are important members of the network, reaching a sizable 5.2% of registered voters on the platform. Supersharers have a significant overrepresentation of women, older adults, and registered Republicans. Supersharers' massive volume does not seem automated but is rather generated through manual and persistent retweeting. These findings highlight a vulnerability of social media for democracy, where a small group of people distort the political reality for many.
README: Supersharers of Fake News on Twitter
This repository contains data and code for replication of the results presented in the paper.
The folders are mostly organized by research questions as detailed below. Each folder contains the code and publicly available data necessary for the replication of results. Importantly, no individual-level data is provided as part of this repository. De-identified individual-level data can be attained for IRB-approved uses under the terms and conditions specified in the paper. Once access is granted, the restricted-access data is expected to be located under ./restricted_data
.
The folders in this repository are the following:
Preprocessing
Code under the preprocessing
folder contains the following:
- source classifier - the code used to train a classifier based on NewsGuard domain flags to match the fake news labels source definition use in Grinberg et el. 2019 labels.
- political classifier - the code used to identify political tweets, including daily training of the models.
- sample creation - the code used to identify supersharers in the larger sample of active panelists as well as the reference groups.
RQ0 Prevalence
Data and replication code for Figure 1 in the main paper as generated by 01_prevalence.Rmd
. The data in this folder is the following:
fig1_panelA_data.csv
- date - value ranging from 2020-08-01 to 2020-11-30
- total_shared_pol_tweets - aggregate number of political news tweets shared by the panel.
- total_shared_fake_pol_tweets - aggregate number of political fake news news tweets shared by the panel.
- ss_fake_pol_tweets - aggregate number of political fake news tweets shared by supersharers.
Fig1_panelB_data.csv.gz
- cat - content category: political news or fake news.
- pct_ppl - percent of people that shared this percent of the aggregate number of content.
- pct_shared - percent of the aggregate amount of content shared.
Fig1_panelC_data.csv.gz
- bins - percentile.
- cat - group category: supersharers (ss-fn) or SS-NF (ss-pol).
- n - numbers of users in this bin.
- percent - percent of users in bin.
article_level_claims_validation.csv
- fake news source - yes/no indication of whether the link originated from a fake news source
- article link / claim - article URL or main claim.
- verifiably false - is the main article claim verifiably false.
- fact-checking - URL for fact-checked information.
- fact-checking additional - additional URL for fact-checked information.
RQ1
Code under the RQ1
folder contains the following:
-
01_reach.ipynb
- Python (mostly pySpark) code for calculating various measures of supersharers' reach. -
02_importance.Rmd
- Code for replicating statistical testing of the calculated measures of network (topological) importance and engagement in RQ1. Requires access to the restricted-access data.
RQ2
Data and replication code for RQ2 including:
-
01_regressions.Rmd
- Code for the regressions comparing supersharers to reference groups. Requires access to the restricted-access data. -
02_fig2.Rmd
- Code for producing Figure 2. Panels A, C, and D can be recreated without access to restricted-access data. Panel B, which shows the full age distribution across groups, requires restricted data. -
03_figS5.Rmd
- Code for producing Figure S5 in the SM.13.
The data in this folder is the following:
Fig2_panelA_data.csv
- cat - group category: Supersharers, SS-NF, Panel, and AVG. fake.
- gender - Male, Female, or Unkown.
- prp - proportion of male, female, or unknown gender in the group.
- prp_lo - the above proportion lower 95% CI.
- prp_hi - the above proportion upper 95% CI.
Fig2_panelC_data.csv
- cat - group category: Supersharers, SS-NF, Panel, and AVG. fake.
- party_reg - party of registration: Democrat (Dem.), Republican (Rep.), or Independent (Ind.).
- prp - proportion of registered Democrats, Republicans, or Independents in the group.
- prp_lo - the above proportion lower 95% CI.
- prp_hi - the above proportion upper 95% CI.
Fig2_panelD_data.csv
- cat - group category: Supersharers, SS-NF, Panel, and AVG. fake.
- prp - proportion of Caucasians in the group.
- prp_lo - the above proportion lower 95% CI.
- prp_hi - the above proportion upper 95% CI.
Fig_S5_state_aggregated_percent.csv
- state_x - US state.
- cat - group category: Supersharers (ss-fn), SS-NF (ss-pol), Panel (other), and AVG. fake (fake_sharer).
- percent - percent of panelists in state from this group.
RQ3
Data and replication code for RQ3 including:
-
01_temporal_analysis.ipynb
- pySpark code for calculating various time of day, interval between subsequent posts, etc. to facilitate temporal analysis described in SM.6. -
02_fig_S3.Rmd
- code for reproducing Figure S3.
The data in this folder is the following:
bot_manual_labeling.csv
This is the manual labeling of a sample of accounts as described in SM.6.
- bot_or_not - indication of the annotator label of the account as potentially bot or not.
- cat - group category: Supersharers (ss-fn), SS-NF (ss-pol), or Panel (other).
Automation_sharing_hours.csv
- cat - group category: Supersharers (ss-fn), SS-NF (ss-pol), or Panel (other).
- hour_of_the_day - hour of the day.
- count - aggregate number of tweets shared by the group in this hour.
Automation_time_between_tweets.csv
- cat - group category: Supersharers (ss-fn), SS-NF (ss-pol), or Panel (other).
- bins_cat - bin category: minutes or seconds.
- bins - bin of time between tweets.
- prp - proportion of tweets in bin in this group.
- prp_lo - the above proportion lower 95% CI.
- prp_hi - the above proportion upper 95% CI.
Automation_session_length.csv
- cat - group category: Supersharers (ss-fn), SS-NF (ss-pol), or Panel (other).
- bins - time bin of session lengths.
- prp - proportion of sessions in bin in this group.
- prp_lo - the above proportion lower 95% CI.
- prp_hi - the above proportion upper 95% CI.
Automation_sessions_count.csv
- cat - group category: Supersharers (ss-fn), SS-NF (ss-pol), or Panel (other).
- avg_sessions - average number of sessions per day.
- prp - percent of users in this group with this number of sessions per day.
- prp_lo - the above proportion lower 95% CI.
- prp_hi - the above proportion upper 95% CI.
Utils
The code used for plotting in R.
Methods
This dataset contains aggregated information necessary to replicate the results reported in our work on Supersharers of Fake News on Twitter while respecting and preserving the privacy expectations of individuals included in the analysis. No individual-level data is provided as part of this dataset.
The data collection process that enabled the creation of this dataset leveraged a large-scale panel of registered U.S. voters matched to Twitter accounts. We examined the activity of 664,391 panel members who were active on Twitter during the months of the 2020 U.S. presidential election (August to November 2020, inclusive), and identified a subset of 2,107 supersharers, which are the most prolific sharers of fake news in the panel that together account for 80% of fake news content shared on the platform. We rely on a source-level definition of fake news, that uses the manually-labeled list of fake news sites by Grinberg et al. 2019 and an updated list based on NewsGuard ratings (commercially available, but not provided as part of this dataset), although the results were robust to different operationalizations of fake news sources. We restrict the analysis to tweets with external links that were identified as political by a machine learning classifier that we trained and validated against human coders, similar to the approach used in prior work.
We address our research questions by contrasting supersharers with three reference groups: people who are the most prolific sharers of non-fake political tweets (supersharers non-fake group; SS-NF), a group of average fake news sharers, and a random sample of panel members. In particular, we identify the distinct sociodemographic characteristics of supersharers using a series of multilevel regressions, examine their use of Twitter through existing tools and additional statistical analysis, and study supersharers' reach by examining the consumption patterns of voters that follow supersharers.