Skip to main content
Dryad logo

Mining the first 100 days: Human and data ethics in Twitter research

Citation

Wheeler, Jonathan; Neely, Teresa (2021), Mining the first 100 days: Human and data ethics in Twitter research, Dryad, Dataset, https://doi.org/10.5061/dryad.d2547d83h

Abstract

This dataset consists of tweet identifiers for tweets harvested between November 28, 2016, following the election of Donald Trump through the end of the first 100 days of his administration. Data collection ended May 1, 2017.

Tweets were harvested using multiple methods described below. The total dataset consists of 218,273,152 tweets. Because of the different methods used to harvest tweets, there may be some duplication.

Methods

Data were harvested from the Twitter API using the following endpoints:

  • search
  • timeline
  • filter

Three tweet sets were harvested using the search endpoint, which returns tweets that include a specific search term, user mention, hashtag, etc.  The table below provides the search term, data collection dates, the total number of tweets in the corresponding tweet set, and the total number of unique Twitter users represented.

Search term   Dates collected Count tweets Count unique users
@realDonaldTrump user mention 2016-11-28 - 2017-05-01 4,597,326  1,501,806
"Trump" in tweet text  2017-01-18 - 2017-05-01 11,055,772  2,648,849
#MAGA hashtag  2017-01-23 - 2017-05-01 1,169,897  236,033

Two tweet sets were harvested using the timeline endpoint, which returns tweets published by specific users. The table below provides the user whose timeline was harvested, data collection dates, the total number of tweets in the corresponding tweet set, and the total number of unique Twitter users represented. Note that in these cases, tweets were necessarily limited to the one unique user whose tweets were harvested.

User Dates collected Count tweets Count unique users
realDonaldTrump 2016-12-21 - 2017-05-01 902 1
trumpRegrets 2017-01-15 - 2017-05-01 1,751 1

The largest tweet set was harvested using the filter endpoint, which allows for streaming data access in near real time. Requests made to this API can be filtered to include tweets that meet specific criteria. The table below provides the filters used, data collection dates, the total number of tweets in the corresponding tweet set, and the total number of unique Twitter users represented.

Filtering via the API uses a default "OR," so the tweets included in this set satisfied any of the filter terms.

The script used to harvest streaming data from the filter API was built using the Python `tweepy` library.

Filter terms Dates collected Count tweets Count unique users

tweets by realDonaldTrump

tweet mentions @realDonaldTrump

'maga' in text

'trump' in text

'potus' in text

2017-01-26 - 2017-05-01 201,447,504  12,489,255

Harvested tweets, including all corresponding metadata, were stored in individual JSON files (one file per tweet).

Data Processing: Conversion to CSV format

Per the terms of Twitter's developer API, tweet datasets may be shared for academic research use. Sharing tweet data is limited to sharing the identifiers of tweets, which must be re-harvested to account for deletions and/or modifications of individual tweets. It is not permitted to share the originally harvested tweets in JSON format.

Tweet identifiers have been extracted from the JSON data and saved as plain text CSV files. The CSV files all have a single column:

  • id_str (string): A tweet identifier

The data include one tweet identifier per row.

Usage Notes

Tweet identifiers are provided in multiple CSV files, relative to the method used to harvest them as described above. For four of the tweet sets, all corresponding tweet identifiers are included in a single file per tweet set:

  • search_maga_hashtag_tweet_ids_2017-01-23-2017-05-01.csv: tweets with a #MAGA hashtag, returned using the search endpoint
  • search_realdonaldtrump_mentions_tweet_ids_2016-11-28-2017-05-01.csv: tweets with an @realDonaldTrump mention, returned using the search endpoint
  • timeline_realdonaldtrump_tweet_ids_2016-12-21-2017-05-01.csv: tweets posted by user realDonaldTrump, retured using the timeline endpoint
  • timeline_trumpregrets_tweet_ids_2017-01-15-2017-05-01.csv: tweets posted by the user trumpRegrets, returned using the timeline endpoint

Due to their size, tweet identifiers for the other two tweet sets were split into multiple CSV files. Tweet identifiers for the tweets harvested using the search API endpoint to search for the word "trump" in the text of the tweet were split into three CSV files, with a maximum number of 5 million tweet identifiers per file:

  • search_trump_string_tweet_ids_2017-01-18-2017-05-01_1.csv
  • search_trump_string_tweet_ids_2017-01-18-2017-05-01_2.csv
  • search_trump_string_tweet_ids_2017-01-18-2017-05-01_3.csv

Tweet identifiers for tweets harvested using the filter API endpoint were split into twenty-one files, with a maximum of 10 million tweet identifiers per file:

  • stream_tweet_ids_2017_01-26-2017-05-01_1.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_2.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_3.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_4.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_5.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_6.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_7.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_8.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_9.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_10.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_11.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_12.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_13.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_14.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_15.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_16.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_17.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_18.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_19.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_20.csv
  • stream_tweet_ids_2017_01-26-2017-05-01_21.csv

All CSV files have been compressed to zip format prior to upload. Each zip file contains one CSV file.

Tweets will have to be reharvested using their identifiers. Multiple tools are available to help with this process, which is referred to as "rehydrating."

References

1. Tweepy [computer software], v3.6.0. (2015). Retrieved from <https://github.com/tweepy/tweepy>
2. Twitter, Inc. (2017, November 3). Developer agreement and policy. Retrieved from <https://developer.twitter.com/en/developer-terms/agreement-and-policy>