Development of public dynamic spatio-temporal monitoring and analysis tool of supply chain vulnerability, resilience, and sustainability
Data files
Jul 13, 2024 version files 71.41 MB
-
bigrams_tf_idf.csv
10.77 MB
-
bigrams.csv
6.74 MB
-
cities.csv
228.23 KB
-
countries.csv
550.49 KB
-
README.md
3.93 KB
-
sentiment.csv
111.17 KB
-
states.csv
349.22 KB
-
topics.csv
1.18 MB
-
unigrams_tf_idf.csv
33.29 MB
-
unigrams.csv
18.19 MB
Abstract
Supply chains play a pivotal role in driving economic growth and societal well-being, facilitating the efficient movement of goods from producers to consumers. However, the increasing frequency of disruptions caused by geopolitical events, pandemics, natural disasters, and shifts in commerce poses significant challenges to supply chain resilience. This draft update report discusses the development of a dynamic spatio-temporal monitoring and analysis tool to assess supply chain vulnerability, resilience, and sustainability. Leveraging news data, macroeconomic metrics, inbound cargo data (for sectors in California), and operational conditions of California’s highways, the tool employs Natural Language Processing (NLP) and empirical regression analyses to identify emerging trends and extract valuable information about disruptions to inform decision-making. Key features of the tool include sentiment analysis of news articles, topic classification, visualization of geographic locations, and tracking of macroeconomic indicators. By integrating diverse and dynamic data sources (e.g., news articles) and using empirical and analytical techniques, the tool offers a comprehensive framework to enhance our understanding of supply chain vulnerabilities and resilience, ultimately contributing to more effective strategies for decision-making in supply chain management. The dynamic nature of this tool enables continuous monitoring and adaptation to evolving conditions, thereby enhancing the analysis of resilience and sustainability in global supply chains.
https://doi.org/10.5061/dryad.qjq2bvqqj
This dataset presents the key features extracted from supply-chain-related news articles. The news articles are gathered based on the following query: (USA or United States) and (supply chain or supply-chain) and (disruption or resilience) and (retailer or warehouse or transportation or factory). The features are extracted using Natural Language Processing (NLP) techniques and include:
- Term frequencies and Term Frequency-Inverse Document Frequency (TF-IDF). Term frequencies and Term Frequency-Inverse Document Frequency (TF-IDF) are calculated at unigram and bigram levels. TF-IDF is a widely used metric for measuring the relationship and relevance of words in documents, where tokens with higher TF-IDF values are considered more representative.
- Topic share. News articles are classified into eight topics relevant to supply chain risks: political, environmental, financial, supply and demand, logistics, system, infrastructure, and sector. Topic share measures the relative frequency with which tokens associated with each topic appear within the content of a news article.
- Sentiment score. A sentiment score is assigned to each news article based on the frequency of positive or negative words, weighted according to their sentiment scale. Sentiment scores for all articles are standardized using a Z-score to ensure better comparability across documents within the corpus.
- Geographical location. News articles are categorized based on geographical locations mentioned within their content. The classification of locations is conducted at three hierarchical levels: countries, states within the U.S., and cities/municipalities within California.
Description of the data and file structure
The entire dataset provides information on a daily basis. Each data file includes a column named ‘published_at’, which represents the publication date of all the gathered news. The ‘published_at’ column is provided in Unix epoch time format. The data files contain the following variables:
- Unigrams, Bigrams. These files include three variables: ‘published_at’, ‘word’ (for unigrams), or ‘bigram’ (for bigrams), and ‘Freq’, which represents the frequency with which the specific word or bigram was used within the articles published on on a specific date.
- Unigrams_tf-idf, Bigrams-tf-idf. These files contain similar variables as the Unigrams and Bigrams files. However, instead of ‘Freq’, these files include the variable ‘tf-idf’, representing the relative frequency of each word or bigram across the articles published on that date. The higher the ‘tf-idf’, the greater the relevance of the unigram or bigram.
- Topics. This file includes three variables: ‘published_at’, ‘topics’ (representing the 8 topics into which the news is classified), and ‘Share’. A news article typically addresses multiple topics in different proportions. ‘Share’ indicates the proportion of each topic mentioned in the news articles published on a specific date. This share is calculated based on the relative frequency with which tokens associated with each topic appear within the content of a news article, allowing for an assessment of the prominence of specific topics across the corpus.
- Sentiment-score. This file contains two variables: ‘published_at’ and ‘sentiment-score’, which corresponds to the normalized sentiment of the news articles published on a specific date.
- Countries, states, cities. These files include three variables: ‘published_at’, ‘location’, and ‘Freq’. ‘Location’ indicates the country (in ISO Alpha-2 code), U.S. states (2-digit code), and California’s cities, while ‘Freq’ indicates the frequency with which each location is mentioned in the news articles published on a specific date.
The research team implemented a two-stage procedure to streamline the collection, processing, and analysis of news data. The stages are as follows:
-
Lexicon Setup: This stage establishes the lexicons required for sentiment and topic analysis. Topics are categorized into eight groups relevant to supply chain risks: political, environmental, financial, supply and demand, logistics, system, infrastructure, and sector. Sentiments are evaluated using a dictionary-based approach with the AFINN lexicon.
Three comprehensive lists of countries, states, and cities are used to classify geographical locations at three hierarchical levels: countries, states within the U.S., and cities/municipalities within California. - News collection and processing: Automated algorithms collect the most recent news daily, with a 24-hour lag, based on the predefined query: (USA or United States) and (supply chain or supply-chain) and (disruption or resilience) and (retailer or warehouse or transportation or factory). Text mining tasks are performed to extract key performance metrics, including n-grams, topics, sentiments, and geographical locations. The process involves several steps:
- Corpus setup,
- Term Frequency-Inverse Document Frequency (TF-IDF) for measuring word relevance in documents,
- Entity recognition and consolidation,
- Conversion of the corpus into a Document-Feature Matrix (DFM),
- Dictionary-based extraction of sentiments, topics, and geographical locations.