Skip to main content
Dryad logo

Performance and social connection data for baseball and basketball from 2001 to 2020

Citation

Evans, Emily; Webb, Benjamin; Jones, Rebecca (2022), Performance and social connection data for baseball and basketball from 2001 to 2020, Dryad, Dataset, https://doi.org/10.5061/dryad.g4f4qrfs5

Abstract

We examine whether social data can be used to predict how members of Major League Baseball (MLB) and members of the National Basketball Association (NBA) transition between teams during their career. We find that incorporating social data into various machine learning algorithms substantially improves the algorithms' ability to correctly determine these transitions in the NBA but only marginally in MLB. We also measure the extent to which player performance and team fitness data can be used to predict transitions between teams. This data, however, only slightly improves our predictions for players for both basketball and baseball players. We also consider whether social, performance, and team fitness data can be used to infer past transitions. Here we find that social data significantly improves our inference accuracy in both the NBA and MLB but player performance and team fitness data again does little to improve this score.

Methods

Performance data was scraped from www.basketball-reference.com/leagues and www.baseball-reference.com/leagues using Python and Beautiful Soup both of which are packages used to extract data from htmls. Since we looked at historical data and used appropriate crawl delays, we met the scraping terms defined in the robots.txt file for both sites. The Twitter data was collected using the Twitter API.

First, we scraped the Twitter usernames for each player listed on www.baseball-reference.com/friv/baseball-player-twitter-accounts.shtml and www.basketball reference.com/friv/twitter.html.

Then using tweepy, a python package for connecting to the Twitter API, we were able to collect the Twitter IDs of the other MLB/NBA players that each player followed. We chose to look at those ``followed" instead of those ``following" because it significantly sped up the data collection process. By using the Twitter API and tweepy, we were able to follow all necessary protocols, including rate limits and only accessing publicly available information.