Skip to main content

Incentivizing news consumption on social media platforms using large language models and realistic bot accounts

Cite this dataset

Askari, Hadi et al. (2024). Incentivizing news consumption on social media platforms using large language models and realistic bot accounts [Dataset]. Dryad.


This project examines how to enhance users' exposure to and engagement with verified and ideologically balanced news in an ecologically valid setting. We rely on a large-scale two-week long field experiment on 28,457 Twitter users. We created 28 bots utilizing GPT-2 that replied to users tweeting about sports, entertainment, or lifestyle with a contextual reply containing two hardcoded elements: a URL to the topic-relevant section of quality news organization and an encouragement to follow its Twitter account. Treated users were randomly assigned to receive responses by bots presented as female or male. We examine whether our intervention enhances the following of news media organization, the sharing/liking of news content and the tweeting/liking of political content. We find that the treated users followed more news accounts and the users in the female bot treatment were more likely to like news content than the control.

README: Incentivizing News Consumption on Social Media Platforms Using Large Language Models and Realistic Bot Accounts

Description of the data and file structure

Dataset contains the following CSV's:

UserInfo.csv - user account-level information

UserAnalysis_Data.csv - user behavioral data, with corresponding treatment labels

UserFollowingChange.csv - change in followed accounts, per user, pre and post experiment, for filtering

Final Coef Estimates.csv - Exported main model coefficients, for plotting figure 4

Final Coef Estimates PolInterest.csv - Exported political interest model coefficients, for plotting figure 5

Final Coef Estimates Class.csv - Exported users topic classes model coefficients, for plotting appendix models

Final_topic_classification.csv - user topic classifications, for merging in topic class models.

The individual files contain the following columns:


user_id - Twitter ID of user    
created_at  - date of account creation
listed_count  - number of lists account features on    
favourites_count  - number of likes from account
statuses_count  - number of posts from account  
friends_count - number of accounts followed by user   
followers_count - numbers of accounts following user


original_user_id  - Twitter ID of user
following  - number of accounts followed pre-treatment 
followingpost - number of accounts followed post-treatment   
postdiff - difference between pre and post accounts followed   
postdiffpct - difference in pct between pre and post accounts followed


UserIDs  - Twitter ID of user   
treatment - treatment group  
treated - where user was treated (not control    
followees_diff - post-treatment difference in number of media accounts followed   
tweets_media_pct_diff  - post-treatment difference in pct. retweets of media accounts
likes_media_pct_diff  - post-treatment difference in pct. likes of media accounts  
tweets_pol_pct_diff  - post-treatment difference in pct. of political tweets    
likes_pol_pct_diff  - post-treatment difference in pct. of political likes
followees_pre  - number of pre-treatment media accounts followed   
tweets_media_pct_pre - pct. of pre-treatment retweets of media accounts   
likes_media_pct_pre  - pct. of pre-treatment likes of media accounts   
tweets_pol_pct_pre  - pct. of pre-treatment tweets political    
likes_pol_pct_pre - pct. of pre-treatment likes political
followees_difftrim - post-treatment difference in number of media accounts followed (suggested accounts only)


UserIDs - Twitter ID of user 

Class - Topical classification of user (Class of tweets most often sent by user)

All the coef csv contain these variables:

Treatment - Treatment Group

Variable - Variable modelled as dependent variable

coef - coefficient estimate

se - standard error

upper - upper bound (95% CI)

lower - lower bound (95% CI)

Model - ITT or Treated Models

UserType -  One of Sports, Entertainment or Lifestyle in the 'Final Coef Estimates Class.csv' file and between 'High Interest' (Political) or 'Low Interest' (Political) in the 'Final Coef Estimates PolInterest.csv' file. 

Function- This refers to the "Score Test" being performed in the Regression Models.

Some cells in the files contain "N/A", which indicates that those cells were not applicable for that user or that information was missing for that user. 

Sharing/Access information

Data was derived from the following sources:

  • Twitter API and Tweepy



Collected via Twitter API and the Python Tweepy library. Contains raw files from our pre and post metrics and also contains our final metrics after all of the classifications (politics and news). 


European Research Council, Award: 756301