Mining the first 100 days: Human and data ethics in Twitter research
Data files
Aug 09, 2021 version files 1.37 GB
-
README.md
8.85 KB
-
README.pdf
169.10 KB
-
search_maga_hashtag_tweet_ids_2017-01-23-2017-05-01.zip
8.24 MB
-
search_realdonaldtrump_mentions_tweet_ids_2016-11-28-2017-05-01.zip
30.42 MB
-
search_trump_string_tweet_ids_2017-01-18-2017-05-01_1.zip
31.40 MB
-
search_trump_string_tweet_ids_2017-01-18-2017-05-01_2.zip
31.75 MB
-
search_trump_string_tweet_ids_2017-01-18-2017-05-01_3.zip
6.76 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_1.zip
62.42 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_10.zip
62.61 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_11.zip
61.92 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_12.zip
61.89 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_13.zip
62.13 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_14.zip
62.58 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_15.zip
62.52 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_16.zip
62.70 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_17.zip
62.44 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_18.zip
62.31 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_19.zip
62.18 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_2.zip
62.32 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_20.zip
62.42 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_21.zip
9.05 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_3.zip
62.31 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_4.zip
62.51 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_5.zip
62.51 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_6.zip
62.56 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_7.zip
63.03 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_8.zip
62.94 MB
-
stream_tweet_ids_2017_01-26-2017-05-01_9.zip
63.08 MB
-
timeline_realdonaldtrump_tweet_ids_2016-12-21-2017-05-01.zip
8.55 KB
-
timeline_trumpregrets_tweet_ids_2017-01-15-2017-05-01.zip
15.64 KB
Abstract
This dataset consists of tweet identifiers for tweets harvested between November 28, 2016, following the election of Donald Trump through the end of the first 100 days of his administration. Data collection ended May 1, 2017.
Tweets were harvested using multiple methods described below. The total dataset consists of 218,273,152 tweets. Because of the different methods used to harvest tweets, there may be some duplication.
Methods
Data were harvested from the Twitter API using the following endpoints:
- search
- timeline
- filter
Three tweet sets were harvested using the search endpoint, which returns tweets that include a specific search term, user mention, hashtag, etc. The table below provides the search term, data collection dates, the total number of tweets in the corresponding tweet set, and the total number of unique Twitter users represented.
Search term | Dates collected | Count tweets | Count unique users |
---|---|---|---|
@realDonaldTrump user mention | 2016-11-28 - 2017-05-01 | 4,597,326 | 1,501,806 |
"Trump" in tweet text | 2017-01-18 - 2017-05-01 | 11,055,772 | 2,648,849 |
#MAGA hashtag | 2017-01-23 - 2017-05-01 | 1,169,897 | 236,033 |
Two tweet sets were harvested using the timeline endpoint, which returns tweets published by specific users. The table below provides the user whose timeline was harvested, data collection dates, the total number of tweets in the corresponding tweet set, and the total number of unique Twitter users represented. Note that in these cases, tweets were necessarily limited to the one unique user whose tweets were harvested.
User | Dates collected | Count tweets | Count unique users |
---|---|---|---|
realDonaldTrump | 2016-12-21 - 2017-05-01 | 902 | 1 |
trumpRegrets | 2017-01-15 - 2017-05-01 | 1,751 | 1 |
The largest tweet set was harvested using the filter endpoint, which allows for streaming data access in near real time. Requests made to this API can be filtered to include tweets that meet specific criteria. The table below provides the filters used, data collection dates, the total number of tweets in the corresponding tweet set, and the total number of unique Twitter users represented.
Filtering via the API uses a default "OR," so the tweets included in this set satisfied any of the filter terms.
The script used to harvest streaming data from the filter API was built using the Python `tweepy` library.
Filter terms | Dates collected | Count tweets | Count unique users |
tweets by realDonaldTrump tweet mentions @realDonaldTrump 'maga' in text 'trump' in text 'potus' in text |
2017-01-26 - 2017-05-01 | 201,447,504 | 12,489,255 |
Harvested tweets, including all corresponding metadata, were stored in individual JSON files (one file per tweet).
Data Processing: Conversion to CSV format
Per the terms of Twitter's developer API, tweet datasets may be shared for academic research use. Sharing tweet data is limited to sharing the identifiers of tweets, which must be re-harvested to account for deletions and/or modifications of individual tweets. It is not permitted to share the originally harvested tweets in JSON format.
Tweet identifiers have been extracted from the JSON data and saved as plain text CSV files. The CSV files all have a single column:
- id_str (string): A tweet identifier
The data include one tweet identifier per row.
Usage notes
Tweet identifiers are provided in multiple CSV files, relative to the method used to harvest them as described above. For four of the tweet sets, all corresponding tweet identifiers are included in a single file per tweet set:
- search_maga_hashtag_tweet_ids_2017-01-23-2017-05-01.csv: tweets with a #MAGA hashtag, returned using the search endpoint
- search_realdonaldtrump_mentions_tweet_ids_2016-11-28-2017-05-01.csv: tweets with an @realDonaldTrump mention, returned using the search endpoint
- timeline_realdonaldtrump_tweet_ids_2016-12-21-2017-05-01.csv: tweets posted by user realDonaldTrump, retured using the timeline endpoint
- timeline_trumpregrets_tweet_ids_2017-01-15-2017-05-01.csv: tweets posted by the user trumpRegrets, returned using the timeline endpoint
Due to their size, tweet identifiers for the other two tweet sets were split into multiple CSV files. Tweet identifiers for the tweets harvested using the search API endpoint to search for the word "trump" in the text of the tweet were split into three CSV files, with a maximum number of 5 million tweet identifiers per file:
- search_trump_string_tweet_ids_2017-01-18-2017-05-01_1.csv
- search_trump_string_tweet_ids_2017-01-18-2017-05-01_2.csv
- search_trump_string_tweet_ids_2017-01-18-2017-05-01_3.csv
Tweet identifiers for tweets harvested using the filter API endpoint were split into twenty-one files, with a maximum of 10 million tweet identifiers per file:
- stream_tweet_ids_2017_01-26-2017-05-01_1.csv
- stream_tweet_ids_2017_01-26-2017-05-01_2.csv
- stream_tweet_ids_2017_01-26-2017-05-01_3.csv
- stream_tweet_ids_2017_01-26-2017-05-01_4.csv
- stream_tweet_ids_2017_01-26-2017-05-01_5.csv
- stream_tweet_ids_2017_01-26-2017-05-01_6.csv
- stream_tweet_ids_2017_01-26-2017-05-01_7.csv
- stream_tweet_ids_2017_01-26-2017-05-01_8.csv
- stream_tweet_ids_2017_01-26-2017-05-01_9.csv
- stream_tweet_ids_2017_01-26-2017-05-01_10.csv
- stream_tweet_ids_2017_01-26-2017-05-01_11.csv
- stream_tweet_ids_2017_01-26-2017-05-01_12.csv
- stream_tweet_ids_2017_01-26-2017-05-01_13.csv
- stream_tweet_ids_2017_01-26-2017-05-01_14.csv
- stream_tweet_ids_2017_01-26-2017-05-01_15.csv
- stream_tweet_ids_2017_01-26-2017-05-01_16.csv
- stream_tweet_ids_2017_01-26-2017-05-01_17.csv
- stream_tweet_ids_2017_01-26-2017-05-01_18.csv
- stream_tweet_ids_2017_01-26-2017-05-01_19.csv
- stream_tweet_ids_2017_01-26-2017-05-01_20.csv
- stream_tweet_ids_2017_01-26-2017-05-01_21.csv
All CSV files have been compressed to zip format prior to upload. Each zip file contains one CSV file.
Tweets will have to be reharvested using their identifiers. Multiple tools are available to help with this process, which is referred to as "rehydrating."
References
1. Tweepy [computer software], v3.6.0. (2015). Retrieved from <https://github.com/tweepy/tweepy>
2. Twitter, Inc. (2017, November 3). Developer agreement and policy. Retrieved from <https://developer.twitter.com/en/developer-terms/agreement-and-policy>