Data cleaning and enrichment through data integration: networking the Italian academia

Finocchi, Irene1; Martino, Alessio 1 ; Sinaimeri, Blerina1; Ranjbar, Fariba1

Published Jan 29, 2025 on Dryad. https://doi.org/10.5061/dryad.wpzgmsbwj

Data files

Jan 29, 2025 version files 140 MB

Coauthorship_Network_AGG.graphml
101.25 MB
codes.zip
4.07 MB
edge_data_AGG.csv

34.67 MB
README.md
6.61 KB
supplementary_data.zip
18.72 KB

Jan 29, 2025 version files 140 MB

Coauthorship_Network_AGG.graphml
101.25 MB
codes.zip
4.07 MB
edge_data_AGG.csv

34.67 MB
README.md
6.63 KB
supplementary_data.zip
18.72 KB

Abstract

We describe a bibliometric network characterizing co-authorship collaborations in the entire Italian academic community. The network, consisting of 38,220 nodes and 507,050 edges, is built upon two distinct data sources: faculty information provided by the Italian Ministry of University and Research and publications available in Semantic Scholar.

Both nodes and edges are associated with a large variety of semantic data, including gender, bibliometric indexes, authors' and publications' research fields, and temporal information. While linking data between the two original sources posed many challenges, the network has been carefully validated to assess its reliability and to understand its graph-theoretic characteristics. By resembling several features of social networks, our dataset can be profitably leveraged in experimental studies in the wide social network analytics domain as well as in more specific bibliometric contexts.

README: Data cleaning and enrichment through data integration: networking the Italian academia

https://doi.org/10.5061/dryad.wpzgmsbwj

Description of the data and file structure

This repository contains two main data files:

edge_data_AGG.csv, the full network in comma-separated edge list format (this file contains mainly temporal co-authorship information);
Coauthorship_Network_AGG.graphml, the full network in GraphML format.

along with several supplementary data, listed below, useful only to build the network (i.e., for reproducibility only):

University-City-match.xlsx, an Excel file that maps the name of a university against the city where its respective headquarter is located;
Areas-SS-CINECA-match.xlsx, an Excel file that maps the research areas in Cineca against the research areas in Semantic Scholar.

Description of the main data files

The Coauthorship_Network_AGG.graphml is intended to be the core file which contains the network data in GraphML format. The end-user will load an undirected graph where nodes pertain to authors and an edge exists between any two authors if there exists at least one paper co-authored by said authors.

Each node (i.e., author) is equipped with the following author-related information:

Affiliation: (string) the name of the university that author is affiliated to;
City: (string) the city where the main headquarter of the Affiliation is located (cfr. the supplementary file University-City-match.xlsx for details);
Gender: (string) either M or F;
Position: (string) the position held by the author. For the sake of simplicity, we have discretised the Italian list of roles into three main universally recognised positions: Full Professor, Associate Professor, Researcher;
ResearchArea: (string) the research area of the author, according to the Italian list of academic fields (the full list can be found here)
PaperCount: (int) the number of papers co-authored by author;
CitationCount: (int) the total number of citations amongst all papers co-authored by author;
hindex: (int) the Hirsch index (a.k.a., h-index) of the author;
SS_AREA: (string) a string encoding a json-like object that, for each paper category (as provided by Semantic Scholar) counts the number of papers belonging to such category. Please note that Semantic Scholar may assign multiple categories for the same paper, so the total number of papers might be different from PaperCount.

Below is an example of the attributes for a node representing an author, identified by the ID 17360:

{'Affiliation': 'TORINO',\
'City': 'Torino',\
'Gender': 'M',\
'Position': 'Associate Professor',\
'ResearchArea': '01/B1',\
'PaperCount': 135,\
'CitationCount': 1593,\
'hindex': 17,\
'SS_AREA': '{"Computer Science": 109, "Philosophy": 2, "Engineering": 2, "Political Science": 4, "Economics": 1, "Linguistics": 3, "Law": 8, "Business": 3, "Environmental Science": 2, "Art": 1, "Geography": 1, "Medicine": 1, "Mathematics": 1, "Education": 1, "Psychology": 1, "Physics": 1}'}

Each edge (i.e., collaboration between any two authors) is equipped with the following collaboration-related information:

paperCategories: (string) a string encoding a json-like object that, for each paper category (as provided by Semantic Scholar) counts the number of papers belonging to such category amongst the ones co-authored by the two authors;
citationCount: (int) the total number of citations received by all papers co-authored by the two authors;
paperCount: (int) the number of papers that the two authors have co-authored together;

Below is an example of the attributes for the edge between nodes with IDs 28881 and 310:

{'paperCount': 41,\
'citationCount': 342,\
'paperCategories': '{"Computer Science": 36, "Biology": 1, "Medicine": 2, "Mathematics": 6, "Engineering": 3, "Business": 1, "Materials Science": 1, "Uncategorized": 5}'}

This repository also features an additional network file, namely edge_data_AGG.csv which contains the network in a comma-separated edge list format. Each row (that is, edge) features"

the Source node ID
the Target node ID
the aforementioned three edge attributes (paperCategories, citationCount and paperCount)
an additional YearRange variable (string) which contains the year of first and last paper(s) co-authored by the Source and Target authors. In case such information does not exist in Semantic Scholar, a default value of -1 is provided instead.

Code/Software

The data contained in this repository is accompanied by several source code files. They are listed below according to the order in which they shall be executed:

main_filtering.py: this is the starting point. The end-user must have previously downloaded all of the 14 Excel files from Cineca (one per each scientific sector) and the authors and papers files from Semantic Scholar. This script performs a preliminary filtering across the Semantic Scholar datasets, which are huge, by retaining just candidate authors by means of a simple matching against Cineca.
Removing DuplicateIds in SemS.py: this is an helper script that, starting from the output of main_filtering.py, returns a structured Excel file containing all the info for each filtered author;

Then, we have 6 different Jupyter notebook files:

Cineca_Name Occurrence 1.ipynb
Cineca_Name Occurrence 2 - FullNames.ipynb
Cineca_Name Occurrence 3 - FullNames.ipynb
Cineca_Name Occurrence 4 - FullNames.ipynb
Cineca_Name Occurrence 5 - FullNames.ipynb
Cineca_Name Occurrence 6 - FullNames.ipynb

Which perform the full matching between authors in Semantic Scholar and Cineca for 1-occurrences to 6-occurrences, respectively.

Finally:

Coauthorship Network.py, which builds the network

and

networkAnalyzer.py, a Python script that serves as an example usage template: it loads the GraphML file and runs some statistics on the network (number of nodes, edges, some centrality measures, the degree distribution plot)

Reuse and Citation Policy

The dataset is distributed under the Creative Commons CC0 license. However, if you use this dataset in a scientific work, we invite you to cite the associated article published in Scientific Data with DOI 10.1038/s41597-025-04608-6.