Data cleaning and enrichment through data integration: networking the Italian academia
Data files
Jan 29, 2025 version files 140 MB
-
Coauthorship_Network_AGG.graphml
101.25 MB
-
codes.zip
4.07 MB
-
edge_data_AGG.csv
34.67 MB
-
README.md
6.61 KB
-
supplementary_data.zip
18.72 KB
Abstract
README: Data cleaning and enrichment through data integration: networking the Italian academia
https://doi.org/10.5061/dryad.wpzgmsbwj
Manuscript published in Scientific Data with DOI
Description of the data and file structure
This repository contains two main data files:
edge_data_AGG.csv
, the full network in comma-separated edge list format (this file contains mainly temporal co-authorship information);Coauthorship_Network_AGG.graphml
, the full network in GraphML format.
along with several supplementary data, listed below, useful only to build the network (i.e., for reproducibility only):
University-City-match.xlsx
, an Excel file that maps the name of a university against the city where its respective headquarter is located;Areas-SS-CINECA-match.xlsx
, an Excel file that maps the research areas in Cineca against the research areas in Semantic Scholar.
Description of the main data files
The Coauthorship_Network_AGG.graphml
is intended to be the core file which contains the network data in GraphML format. The end-user will load an undirected graph where nodes pertain to authors and an edge exists between any two authors if there exists at least one paper co-authored by said authors.
Each node (i.e., author) is equipped with the following author-related information:
Affiliation
: (string) the name of the university that author is affiliated to;City
: (string) the city where the main headquarter of theAffiliation
is located (cfr. the supplementary fileUniversity-City-match.xlsx
for details);Gender
: (string) either M or F;Position
: (string) the position held by the author. For the sake of simplicity, we have discretised the Italian list of roles into three main universally recognised positions: Full Professor, Associate Professor, Researcher;ResearchArea
: (string) the research area of the author, according to the Italian list of academic fields (the full list can be found here)PaperCount
: (int) the number of papers co-authored by author;CitationCount
: (int) the total number of citations amongst all papers co-authored by author;hindex
: (int) the Hirsch index (a.k.a., h-index) of the author;SS_AREA
: (string) a string encoding a json-like object that, for each paper category (as provided by Semantic Scholar) counts the number of papers belonging to such category. Please note that Semantic Scholar may assign multiple categories for the same paper, so the total number of papers might be different fromPaperCount
.
Below is an example of the attributes for a node representing an author, identified by the ID 17360
:
{'Affiliation': 'TORINO',
\
'City': 'Torino',
\
'Gender': 'M',
\
'Position': 'Associate Professor',
\
'ResearchArea': '01/B1',
\
'PaperCount': 135,
\
'CitationCount': 1593,
\
'hindex': 17,
\
'SS_AREA': '{"Computer Science": 109, "Philosophy": 2, "Engineering": 2, "Political Science": 4, "Economics": 1, "Linguistics": 3, "Law": 8, "Business": 3, "Environmental Science": 2, "Art": 1, "Geography": 1, "Medicine": 1, "Mathematics": 1, "Education": 1, "Psychology": 1, "Physics": 1}'}
Each edge (i.e., collaboration between any two authors) is equipped with the following collaboration-related information:
paperCategories
: (string) a string encoding a json-like object that, for each paper category (as provided by Semantic Scholar) counts the number of papers belonging to such category amongst the ones co-authored by the two authors;citationCount
: (int) the total number of citations received by all papers co-authored by the two authors;paperCount
: (int) the number of papers that the two authors have co-authored together;
Below is an example of the attributes for the edge between nodes with IDs 28881
and 310
:
{'paperCount': 41,
\
'citationCount': 342,
\
'paperCategories': '{"Computer Science": 36, "Biology": 1, "Medicine": 2, "Mathematics": 6, "Engineering": 3, "Business": 1, "Materials Science": 1, "Uncategorized": 5}'}
This repository also features an additional network file, namely edge_data_AGG.csv
which contains the network in a comma-separated edge list format. Each row (that is, edge) features"
- the
Source
node ID - the
Target
node ID - the aforementioned three edge attributes (
paperCategories
,citationCount
andpaperCount
) - an additional
YearRange
variable (string) which contains the year of first and last paper(s) co-authored by theSource
andTarget
authors. In case such information does not exist in Semantic Scholar, a default value of -1 is provided instead.
Code/Software
The data contained in this repository is accompanied by several source code files. They are listed below according to the order in which they shall be executed:
main_filtering.py
: this is the starting point. The end-user must have previously downloaded all of the 14 Excel files from Cineca (one per each scientific sector) and the authors and papers files from Semantic Scholar. This script performs a preliminary filtering across the Semantic Scholar datasets, which are huge, by retaining just candidate authors by means of a simple matching against Cineca.Removing DuplicateIds in SemS.py
: this is an helper script that, starting from the output ofmain_filtering.py
, returns a structured Excel file containing all the info for each filtered author;
Then, we have 6 different Jupyter notebook files:
Cineca_Name Occurrence 1.ipynb
Cineca_Name Occurrence 2 - FullNames.ipynb
Cineca_Name Occurrence 3 - FullNames.ipynb
Cineca_Name Occurrence 4 - FullNames.ipynb
Cineca_Name Occurrence 5 - FullNames.ipynb
Cineca_Name Occurrence 6 - FullNames.ipynb
Which perform the full matching between authors in Semantic Scholar and Cineca for 1-occurrences to 6-occurrences, respectively.
Finally:
Coauthorship Network.py
, which builds the network
and
networkAnalyzer.py
, a Python script that serves as an example usage template: it loads the GraphML file and runs some statistics on the network (number of nodes, edges, some centrality measures, the degree distribution plot)
Reuse and Citation Policy
The dataset is distributed under the Creative Commons CC0 license. However, if you use this dataset in a scientific work, we invite you to cite the associated article.
Methods
The proposed network is built starting from two distinct data sources:
- the entire dataset dump from Semantic Scholar (with particular emphasis on the authors and papers datasets)
- the entire list of Italian faculty members as maintained by Cineca (under appointment by the Italian Ministry of University and Research).
By means of a custom name-identity recognition algorithm (details are available in the accompanying paper published in Scientific Data), the names of the authors in the Semantic Scholar dataset have been mapped against the names contained in the Cineca dataset and authors with no match (e.g., because of not being part of an Italian university) have been discarded. The remaining authors will compose the nodes of the network, which have been enriched with node-related (i.e., author-related) attributes.
In order to build the network edges, we leveraged the papers dataset from Semantic Scholar: specifically, any two authors are said to be connected if there is at least one paper co-authored by said authors. Then, the edges have been enriched with edge-related (i.e., collaboration-related) attributes.