Key generic technology prediction in patent citation using graph neural networks

Ding, M. L.1

Published Jan 11, 2024 on Dryad. https://doi.org/10.5061/dryad.nk98sf803

Data files

Jan 11, 2024 version files 11.16 MB

A)Table_of_key_generic_indicators_for_nodes_(partial_1).csv

2.31 MB
B)Table_of_key_generic_indicators_for_nodes_(partial_2).csv

4.08 MB
C)patent.content

3.63 MB
D)patent.cites

1.13 MB
E)Graph_neural_network_modeling_highest_accuracy_for_different_dimensions.csv

99 B
F)Prediction_effects_of_key_generic_technologies.csv

243 B
README.md

4.34 KB

Jan 11, 2024 version files 11.15 MB

A)Table_of_key_generic_indicators_for_nodes_(partial_1).csv

2.31 MB
B)Table_of_key_generic_indicators_for_nodes_(partial_2).csv

4.08 MB
C)patent.content

3.63 MB
D)patent.cites

1.13 MB
E)Graph_neural_network_modeling_highest_accuracy_for_different_dimensions.csv

99 B
F)Prediction_effects_of_key_generic_technologies.csv

243 B

Abstract

With the rapid advancement of the Fourth Industrial Revolution, international competition in technology and industry is intensifying. However, in the era of big data and large-scale science, making accurate judgments about the key areas of technology and innovative trends has become exceptionally difficult. This paper constructs a patent indicator evaluation system based on the dimensions of key and generic patent citation, integrates graph neural network modeling to predict key common technologies, and confirms the effectiveness of the method using the field of genetic engineering as an example. According to the LDA topic model, the main technical R&D directions in genetic engineering are genetic analysis and detection technologies, the application of microorganisms in industrial production, virology research involving vaccine development and immune responses, high-throughput sequencing and analysis technologies in genomics, targeted drug design and molecular therapeutic strategies, genetically modified crop improvement. The accuracy of predicting key generic technologies related to graph neural networks reaches 97.67%. Based on patent citation theory and the graph neural network models, this paper considers the structural and technical attributes of cited patents, providing theoretical and empirical evidence for technology prediction, and possessing certain theoretical and practical value.

This README file was generated on 2023-11-25 by Mingli Ding.

GENERAL INFORMATION

Author Information
Investigators Contact Information
Name: Mingli Ding; Wangke Yu; Shuhua Wang
Institution: Jingdezhen Ceramic University
Address: Jingdezhen, Jiangxi, China
Email: mlding1@163.com
Date of data collection:2013-2022

DATA & FILE OVERVIEW

File List:

A) Table of key generic indicators for nodes (partial 1).csv

B) Table of key generic indicators for nodes (partial 2).csv

C) patent.content

D) patent.cites

E) Graph neural network modeling highest accuracy for different dimensions.csv

F) Prediction effects of key generic technologies.csv

DATA-SPECIFIC INFORMATION FOR: Table of key generic indicators for nodes (partial 1).csv

Number of variables: 10
Number of cases/rows: 72489
Variable List:

technical coverage: number of national economic classifications
patent families: number of patent families
patent family citation: patent family average annual citation frequency
patent cooperation: whether there are more than two applicants who have jointly applied for a patent.
enterprise-enterprise cooperation: whether more than two enterprises have jointly applied for a patent.
industry-university-research cooperation: whether any enterprises have applied for a patent jointly with universities or research institutions.
claims: number of claims
citation frequency: average annual citation frequency
layout countries: number of layout countries
layout countries: age of patents

DATA-SPECIFIC INFORMATION FOR: Table of key generic indicators for nodes (partial 2).csv

Number of variables: 10
Number of cases/rows: 72489
Variable List:

technical convergence: number of deputy IPCs (International Patent Classification)
cited countries: number of cited countries
inventors: number of inventors
citations: number of forward citing times
homologous countries/areas: number of homologous countries/areas
degree centrality: the degree to which a node in a network is associated with all other nodes can be calculated, and is the simplest indicator to quantify the influence of a node.
closeness centrality: whether a node is at the core of a technological network, indicating how close the node is to all other nodes in the network.
betweenness centrality: number of multiple other nodes when it is on the shortest path, characterizes the node as having a strong resource control ability in the network.
eigenvector centrality: it is a measure of the importance of nodes in a graph, which is based on the idea that the centrality of a node is a function of the centrality of its neighboring nodes.
PageRank: an algorithm used to rank the importance of nodes on the web, defined as a function on a collection of web pages that assigns a positive real number to each web page to indicate its importance, and these values form a vector as a whole.

DATA-SPECIFIC INFORMATION FOR: patent.content

Number of variables: 22
Number of cases/rows: 72489
Variable List:

ID: ID number of patents
variables2-21: the same as the variables in file A) and file B)
label: CORE (the patent is key generic technology) or NON (the patent is not key generic technology).

DATA-SPECIFIC INFORMATION FOR: patent.cites

Number of variables: 2
Number of cases/rows: 72489
Variable List:

source: the ID number of the cited patent
target: the ID number of citing patent

DATA-SPECIFIC INFORMATION FOR: Graph neural network modeling highest accuracy for different dimensions.csv

Number of variables: 4
Number of cases/rows: 3
Variable List:

dimensions of graph network: 4
dimensions of graph network: 8
dimensions of graph network: 12
dimensions of graph network: 16

DATA-SPECIFIC INFORMATION FOR: Prediction effects of key generic technologies.csv

Number of variables: 10
Number of cases/rows: 3
Variable List:

epochs: 100
epochs: 200
epochs: 300
epochs: 400
epochs: 500
epochs: 600
epochs: 700
epochs: 800
epochs: 900
epochs: 1000

These datasets were obtained by the Incopat patent database for cited patents (2013-2022) in the field of genetic engineering.

Details for the datasets are provided in the README file.

This directory contains the selection of the patent datasets.

1) Table of key generic indicators for nodes (partial 1).csv

This file consists of 10 indicators of patents: technical coverage, patent families, patent family citation, patent cooperation, enterprise-enterprise cooperation, industry-university-research cooperation, claims, citation frequency, layout countries, and layout countries.

2) Table of key generic indicators for nodes (partial 2).csv

This file consists of 10 indicators of patents: technical convergence, cited countries, inventors, citations, homologous countries/areas, degree centrality, closeness centrality, betweenness centrality, eigenvector centrality, and PageRank.

3) patent.content

The content file contains descriptions of the patents in the following format: <ID_number> <technical_attributes> + <class_label>. Each line contains two patent ID numbers. The first entry is the ID number of the patent being cited and the second publish number stands for the patent which contains the citation. The direction of the link is from right to left. If a line is represented by "patent1 patent2" then the link is "patent2->patent1".

4) patent.cites

The first entry in each line contains the unique string ID number of the patents followed by binary values indicating whether the value of each patent exceeds the average of the corresponding indicator (indicated by 1) or absent (indicated by 0) in the patent. Finally, the last entry in the line contains the class label of the patent.

5) Graph neural network modeling highest accuracy for different dimensions.csv

This file shows the best accuracies of GCN, SAGE, and GAT models in different dimensions.

6) Prediction effects of key generic technologies.csv

This file shows the accuracies of GCN, SAGE, and GAT models in different epochs.