Skip to main content
Dryad

Data from: A stochastic generative model for citation networks among academic papers

Cite this dataset

Yasui, Yuichiro (2022). Data from: A stochastic generative model for citation networks among academic papers [Dataset]. Dryad. https://doi.org/10.5061/dryad.z8w9ghxfh

Abstract

We propose a stochastic generative model to represent a directed graph constructed by citations among academic papers, where nodes and directed edges represent papers with discrete publication time and citations respectively. The proposed model assumes that a citation between two papers occurs with a probability based on the type of the citing paper, the importance of cited paper, and the difference between their publication times, like the existing models. We consider the out-degrees of citing paper as its type, because, for example, survey paper cites many papers. We approximate the importance of a cited paper by its in-degrees. In our model, we adopt three functions: a logistic function for illustrating the numbers of papers published in discrete time, an inverse Gaussian probability distribution function to express the aging effect based on the difference between publication times, and an exponential distribution (or a generalized Pareto distribution) for describing the out-degree distribution. We consider that our model is a more reasonable and appropriate stochastic model than other existing models and can perform complete simulations without using original data. In this paper, we first use the Web of Science database and see the features used in our model. By using the proposed model, we can generate simulated graphs and demonstrate that they are similar to the original data concerning the in- and out-degree distributions, and node triangle participation. In addition, we analyze two other citation networks derived from physics papers in the arXiv database and verify the effectiveness of the model.

Methods

We focus on a subset of the Web of Science (WoS), WoS-Stat, which is a citation network that comprises the citations between papers published in journals whose subject is associated with “Statistics and Probability.” We construct a citation network utilizing a paper identifier (ID), publication year, and reference list (list of paper IDs) for 36 years, from 1981 to 2016. WoS-Stat consists of 179,483 papers and 1,106,622 citations.

Usage notes

WoS-Stat consists of two CSV files, wos-stat_edges.csv and wos-stat_nodes.csv, which can be used with a variety of software.

The following example uses Python NetworkX to construct a directed graph G from the CSV files. Each node of G has a publication_year corresponding to the year of publication and a uid corresponding to the literature ID in the Web of science.

import pandas as pd
import networkx as nx
edges = pd.read_csv('wos-stat_edges.csv')
nodes = pd.read_csv('wos-stat_nodes.csv', index_col='node_id')
G = nx.from_pandas_edgelist(edges, source='from', target='to', create_using=nx.DiGraph)
nx.set_node_attributes(G, nodes.to_dict('index'))