This notebook processes the 2.8 million profiles in the ORCID public data (16 December 2016) into a data set of the 741,867 profiles that have at least one listed education or employment affiliation. This outputs two files:
ORCID_migrations_2016_12_16.csv
(222 MB)ORCID_migrations_2016_12_16_by_person.csv
(31 MB)ORCID provided a compressed tarball containing the data in both XML format and JSON format. I needed about 300 GB of free space to decompress the data and work with it. If you don't have that much free space on your computer, an external SSD hard drive works well.
The following code takes the JSON version of the decompressed ORCID data as input. It is written in Python 3.4.5.
John Bohannon, Science Magazine
April 2017
import json, os, sys
import pandas as pd
def file_generator(json_dir):
''' Using a generator allows pausing and restarting
without having to figure out where you left off. '''
for n, i in enumerate(os.listdir(json_dir)):
yield (n, os.path.join(json_dir, i))
def get_profiles(data, json_files, stop = None):
''' Iterate over JSON files and process them '''
for n, filepath in json_files:
# keep track of progress
sys.stdout.flush()
sys.stdout.write('\r{}'.format(n))
# terminate if stop is specified and reached
if stop is not None and n >= stop:
return
# process this JSON file and harvest the data
if filepath.endswith(".json"):
with open(filepath) as f:
profile = json.load(f)
for row in get_affiliations(profile):
data.append(row)
def has_affiliation(profile):
''' This tests whether the profile has any affiliations '''
try:
if profile["orcid-profile"]["orcid-activities"]["affiliations"]["affiliation"]:
return True
except:
return False
def get_affiliations(profile):
''' For each profile, extract all affiliations and metadata '''
profile_data = []
if has_affiliation(profile):
orcid_id = profile["orcid-profile"]["orcid-identifier"]["path"]
for aff in profile["orcid-profile"]["orcid-activities"]["affiliations"]["affiliation"]:
row = [orcid_id]
row.append(aff["organization"]["address"]["country"])
try:
row.append(aff["organization"]["name"])
except:
row.append(None)
try:
row.append(aff["organization"]["disambiguated-organization"]["disambiguated-organization-identifier"])
except:
row.append(None)
try:
row.append(aff["start-date"]["year"]["value"])
except:
row.append(None)
try:
row.append(aff["end-date"]["year"]["value"])
except:
row.append(None)
try:
row.append(aff["type"])
except:
row.append(None)
try:
row.append(aff["role-title"])
except:
row.append(None)
profile_data.append(row)
return profile_data
json_dir = "/Volumes/playground/ORCID/public_profiles/json"
json_files = file_generator(json_dir)
data = []
%%time
get_profiles(data, json_files)
This took nearly 2 hours. Be patient.
There are 2.8 million ORCID profiles. Let's see how many of those profiles include affiliations data...
df = pd.DataFrame(data, columns = ["orcid_id", "country", "organization_name",
"Ringgold_id", "start_year", "end_year",
"affiliation_type", "affiliation_role"])
df.head()
df.orcid_id.nunique(), len(df)
There are 753,238 researchers with ORCID profiles that include affiliations data, with a total of about 2 million affiliations over the course of their careers.
education_without_dates = df[((df.affiliation_type == "EDUCATION") &
(df.start_year.isnull()) &
(df.end_year.isnull()))]
len(education_without_dates), education_without_dates.orcid_id.nunique()
NOTE: There are 96,143 profiles that include EDUCATION affiliations that do not have any start_date or end_date. These researchers are just listing their education affiliations without noting the year they obtained the degree.
These data stop in December 2016. So affiliations that do have a start_year but no end_year are ongoing affilations. How many are there?
ongoing = df[((df.start_year.notnull()) & (df.end_year.isnull()))]
len(ongoing), ongoing.orcid_id.nunique()
There are about half a million people with ongoing affiliations as of 2016. Let's recode those missing end_year values of ongoing affiliations as 2016.
def recode_ongoing(row):
''' Recode the end_year value of some affiliations to 2016
since they are ongoing affiliations. '''
if row.start_year is not None and row.end_year is None:
return 2016
else:
return row.end_year
df.end_year = df.apply(recode_ongoing, axis = 1)
Next we should exclude people whose country of residence in 2016 is ambiguous, i.e. those with more than one country affilation. How many are there?
today_affs = df[df.end_year == 2016]
g = today_affs.groupby(["orcid_id", "country"]).aggregate(len).reset_index()
multi_2016_country_people = set(g[g.duplicated("orcid_id")].orcid_id)
len(multi_2016_country_people)
There are about 11,000 of these ambiguous people. Let's exclude them, since for the analysis of migrations, we don't know where these people ended up in the present day.
df = df[~df.orcid_id.isin(multi_2016_country_people)]
Next, we want to know which of these people got PhD degrees. So let's classify the affiliation_role of the EDUCATION affiliations as being PhD degrees or not, just using simple string matching. I built this classifier through trial and error on these data. I have not yet found an exception, but there are probably some in there. The error rate should be very low.
def is_phd(role):
''' After lowercasing the affiliation_role string, look for
terms that indicate it is a Ph.D. degree.
Yes, regular expressions are tidier, but also unreadable
to most people and harder to debug. '''
# These are international synonyms for the Ph.D. degree
synonyms = ("phd", "ph.d", "dphil", "d.phil", "rer. nat",
"rer, nat", "doctor rerum", "doktor rerum")
# This catches things like "Doctorate en Chimie" but
# excludes "Postdoctoral Fellow" and "Medical Doctorate"
special_cases_allowed = ("doctor", "doktor")
special_cases_disallowed = ("pre", "post", "med")
if type(role) == str:
# lowercase the string
role = role.lower()
# Look for Ph.D. synonyms
if any([(i in role) for i in synonyms]):
return True
# Look for special cases
if any([(i in role) for i in special_cases_allowed]) and \
not any([(i in role) for i in special_cases_disallowed]):
return True
# Otherwise call it False
return False
df["is_phd"] = df.affiliation_role.apply(is_phd)
df.head()
len(df), df.orcid_id.nunique(), df.country.nunique()
We have about 2 million affiliation observations from 741,867 ORCID profiles. Those affiliations are located in 241 countries, i.e. everywhere.
# save the affiliations data to local directory
df.to_csv('ORCID_migrations_2016_12_16.csv')
Next, let's create a data set that summarizes important information about each person.
This dataframe will be indexed by orcid_id: one row per person. And let's start by generating a phd_year
column with the year in which they completed their PhD.
Note: The value of phd_year
will be 2016 for ongoing PhD students.
df = pd.read_csv('ORCID_migrations_2016_12_16.csv', index_col = 0)
people = pd.DataFrame(index = df.orcid_id.unique())
people.index.name = 'orcid_id'
people["phd_year"] = df[(df.is_phd)
& (df.affiliation_type == "EDUCATION")
].groupby("orcid_id").end_year.max()
people.head()
len(people)
There you have the year (if any) of PhD completion for each of the 740,000 ORCID profiles that include affiliations.
Now let's determine each person's country of residence in 2016.
today_countries = df[df.end_year == 2016].drop_duplicates(subset = 'orcid_id',
keep = 'first')
today_countries = today_countries.set_index("orcid_id").country
today_countries.name = "country_2016"
today_countries.head()
people = pd.merge(people, today_countries.to_frame(), how = 'left',
left_index = True, right_index = True)
people.head()
len(people), people.index.nunique()
Good. So now we can see where each person ended up in 2016, if they have an affiliation that ended in 2016 or is ongoing.
We don't know the true nationalities of people with ORCID profiles, but we can infer them from the country of their earliest affiliation. So let's find the year and country of each person's earliest affiliation.
people['earliest_year'] = df.groupby("orcid_id").start_year.min()
people.head()
earliest_affs = df[df.start_year == df.groupby('orcid_id').start_year.transform(min)]
earliest_affs = earliest_affs.drop_duplicates(subset = 'orcid_id',
keep = 'first').set_index('orcid_id').country
earliest_affs.name = 'earliest_country'
people = pd.merge(people, earliest_affs.to_frame(), how = 'left',
right_index = True, left_index = True)
people.head()
Good. Now let's mark people who obtained a PhD, and let's list the country in which they got their PhD.
people['has_phd'] = df.groupby('orcid_id').is_phd.max()
people.head()
phd_country = df[(df.affiliation_type == 'EDUCATION') & (df.is_phd)]
phd_country = phd_country.drop_duplicates(subset = 'orcid_id',
keep = 'first').set_index('orcid_id').country
phd_country.name = 'phd_country'
people = pd.merge(people, phd_country.to_frame(), how = 'left',
right_index = True, left_index = True)
people.head()
Let's get some summary statistics based on these biographical data...
# total people who have both an earliest country and a 2016 country
len(people[(people.earliest_country.notnull()) &
(people.country_2016.notnull())])
# total people in 2016 in a country different from their earliest country
len(people[(people.earliest_country.notnull()) &
(people.country_2016.notnull()) &
(people.country_2016 != people.earliest_country)])
So about 15% of people (75,000 / 500,000) live in a country in 2016 different from their earliest affiliation.
Let's look at the distribution of PhD degrees.
len(people), len(people[people.has_phd])
44% of the ORCID profiles that include any affiliations list a PhD degree.
Where did they get those PhD degrees?
phds = people[people.has_phd].phd_country.value_counts()
phds.name = "PhDs"
phds = phds.to_frame()
phds["% of total"] = round(phds.PhDs / len(people[people.has_phd]) * 100, 2)
phds.head(10)
So of the 330,000 people with a PhD, 22% got it in the US, 10% in the UK, 6% in Spain (doubtful), 5% in India, and 5% in China.
NOTE: Clearly some biases in the rate at which people from various countries have created ORCID profiles for themselves. Proceed with caution.
Save the people
data to local directory. They are ready for play. Have fun!
people.to_csv('ORCID_migrations_2016_12_16_by_person.csv')