This notebook processes the 2.8 million profiles in the ORCID public data (16 December 2016) into a data set of the 741,867 profiles that have at least one listed education or employment affiliation. This outputs two files:

  • ORCID_migrations_2016_12_16.csv (222 MB)
  • ORCID_migrations_2016_12_16_by_person.csv (31 MB)

ORCID provided a compressed tarball containing the data in both XML format and JSON format. I needed about 300 GB of free space to decompress the data and work with it. If you don't have that much free space on your computer, an external SSD hard drive works well.

The following code takes the JSON version of the decompressed ORCID data as input. It is written in Python 3.4.5.

John Bohannon, Science Magazine
April 2017

In [1]:
import json, os, sys
import pandas as pd

def file_generator(json_dir):
    ''' Using a generator allows pausing and restarting
    without having to figure out where you left off. '''
    for n, i in enumerate(os.listdir(json_dir)):
        yield (n, os.path.join(json_dir, i))

def get_profiles(data, json_files, stop = None):
    ''' Iterate over JSON files and process them '''
    for n, filepath in json_files:
        # keep track of progress
        sys.stdout.flush()
        sys.stdout.write('\r{}'.format(n))
        # terminate if stop is specified and reached
        if stop is not None and n >= stop:
            return
        # process this JSON file and harvest the data
        if filepath.endswith(".json"):
            with open(filepath) as f:
                profile = json.load(f)
                for row in get_affiliations(profile):
                    data.append(row)

def has_affiliation(profile):
    ''' This tests whether the profile has any affiliations '''
    try:
        if profile["orcid-profile"]["orcid-activities"]["affiliations"]["affiliation"]:
            return True
    except:
        return False

def get_affiliations(profile):
    ''' For each profile, extract all affiliations and metadata '''
    profile_data = []
    if has_affiliation(profile):
        orcid_id = profile["orcid-profile"]["orcid-identifier"]["path"]
        for aff in profile["orcid-profile"]["orcid-activities"]["affiliations"]["affiliation"]:
            row = [orcid_id]
            row.append(aff["organization"]["address"]["country"])
            try:
                row.append(aff["organization"]["name"])
            except:
                row.append(None)
            try:
                row.append(aff["organization"]["disambiguated-organization"]["disambiguated-organization-identifier"])
            except:
                row.append(None)
            try:
                row.append(aff["start-date"]["year"]["value"])
            except:
                row.append(None)
            try:
                row.append(aff["end-date"]["year"]["value"])
            except:
                row.append(None)
            try:
                row.append(aff["type"])
            except:
                row.append(None)
            try:
                row.append(aff["role-title"])
            except:
                row.append(None)
            profile_data.append(row)
    return profile_data
In [22]:
json_dir = "/Volumes/playground/ORCID/public_profiles/json"
json_files = file_generator(json_dir)
In [23]:
data = []
In [24]:
%%time
get_profiles(data, json_files)
2814361CPU times: user 1h 24min 54s, sys: 30min 19s, total: 1h 55min 14s
Wall time: 1h 47min 3s

This took nearly 2 hours. Be patient.

There are 2.8 million ORCID profiles. Let's see how many of those profiles include affiliations data...

In [26]:
df = pd.DataFrame(data, columns = ["orcid_id", "country", "organization_name", 
                              "Ringgold_id", "start_year", "end_year", 
                              "affiliation_type", "affiliation_role"])
df.head()
Out[26]:
orcid_id country organization_name Ringgold_id start_year end_year affiliation_type affiliation_role
0 0000-0001-5000-0138 CO Universidad Del Rosario 25807 2014 None EMPLOYMENT Profesor Principal
1 0000-0001-5000-0736 PT Faculty of Science and Technology, New Univers... None None None EMPLOYMENT PhD in Geology - Invited Assistant Professor
2 0000-0001-5000-0736 PT Universidade de Évora Área Departamental de Ci... 98820 None 2006 EDUCATION PhD in Geology
3 0000-0001-5000-0736 PT Universidade de Lisboa 37809 None 1997 EDUCATION MSc in Internal Geodynamics
4 0000-0001-5000-0736 PT Universidade de Lisboa 37809 None 1992 EDUCATION Graduation in Geology
In [29]:
df.orcid_id.nunique(), len(df)
Out[29]:
(753238, 2048627)

There are 753,238 researchers with ORCID profiles that include affiliations data, with a total of about 2 million affiliations over the course of their careers.

In [34]:
education_without_dates = df[((df.affiliation_type == "EDUCATION") & 
                       (df.start_year.isnull()) & 
                       (df.end_year.isnull()))]
len(education_without_dates), education_without_dates.orcid_id.nunique()
Out[34]:
(114096, 96143)

NOTE: There are 96,143 profiles that include EDUCATION affiliations that do not have any start_date or end_date. These researchers are just listing their education affiliations without noting the year they obtained the degree.

These data stop in December 2016. So affiliations that do have a start_year but no end_year are ongoing affilations. How many are there?

In [35]:
ongoing = df[((df.start_year.notnull()) & (df.end_year.isnull()))]
len(ongoing), ongoing.orcid_id.nunique()
Out[35]:
(588628, 489928)

There are about half a million people with ongoing affiliations as of 2016. Let's recode those missing end_year values of ongoing affiliations as 2016.

In [45]:
def recode_ongoing(row):
    ''' Recode the end_year value of some affiliations to 2016 
        since they are ongoing affiliations. '''
    if row.start_year is not None and row.end_year is None:
        return 2016
    else:
        return row.end_year
    
df.end_year = df.apply(recode_ongoing, axis = 1)

Next we should exclude people whose country of residence in 2016 is ambiguous, i.e. those with more than one country affilation. How many are there?

In [49]:
today_affs = df[df.end_year == 2016]
g = today_affs.groupby(["orcid_id", "country"]).aggregate(len).reset_index()
multi_2016_country_people = set(g[g.duplicated("orcid_id")].orcid_id)

len(multi_2016_country_people)
Out[49]:
11371

There are about 11,000 of these ambiguous people. Let's exclude them, since for the analysis of migrations, we don't know where these people ended up in the present day.

In [52]:
df = df[~df.orcid_id.isin(multi_2016_country_people)]

Next, we want to know which of these people got PhD degrees. So let's classify the affiliation_role of the EDUCATION affiliations as being PhD degrees or not, just using simple string matching. I built this classifier through trial and error on these data. I have not yet found an exception, but there are probably some in there. The error rate should be very low.

In [138]:
def is_phd(role):
    ''' After lowercasing the affiliation_role string, look for
        terms that indicate it is a Ph.D. degree.
        Yes, regular expressions are tidier, but also unreadable 
        to most people and harder to debug. '''
    # These are international synonyms for the Ph.D. degree
    synonyms = ("phd", "ph.d", "dphil", "d.phil", "rer. nat", 
                "rer, nat", "doctor rerum", "doktor rerum")
    # This catches things like "Doctorate en Chimie" but
    # excludes "Postdoctoral Fellow" and "Medical Doctorate"
    special_cases_allowed = ("doctor", "doktor")
    special_cases_disallowed = ("pre", "post", "med")
    if type(role) == str:
        # lowercase the string
        role = role.lower()
        # Look for Ph.D. synonyms
        if any([(i in role) for i in synonyms]):
            return True
        # Look for special cases
        if any([(i in role) for i in special_cases_allowed]) and \
        not any([(i in role) for i in special_cases_disallowed]):
            return True
    # Otherwise call it False
    return False

df["is_phd"] = df.affiliation_role.apply(is_phd)
df.head()
Out[138]:
orcid_id country organization_name Ringgold_id start_year end_year affiliation_type affiliation_role is_phd
0 0000-0001-5000-0138 CO Universidad Del Rosario 25807 2014 2016 EMPLOYMENT Profesor Principal False
1 0000-0001-5000-0736 PT Faculty of Science and Technology, New Univers... None None None EMPLOYMENT PhD in Geology - Invited Assistant Professor True
2 0000-0001-5000-0736 PT Universidade de Évora Área Departamental de Ci... 98820 None 2006 EDUCATION PhD in Geology True
3 0000-0001-5000-0736 PT Universidade de Lisboa 37809 None 1997 EDUCATION MSc in Internal Geodynamics False
4 0000-0001-5000-0736 PT Universidade de Lisboa 37809 None 1992 EDUCATION Graduation in Geology False
In [155]:
len(df), df.orcid_id.nunique(), df.country.nunique()
Out[155]:
(1988114, 741867, 241)

We have about 2 million affiliation observations from 741,867 ORCID profiles. Those affiliations are located in 241 countries, i.e. everywhere.

In [152]:
# save the affiliations data to local directory
df.to_csv('ORCID_migrations_2016_12_16.csv')

Next, let's create a data set that summarizes important information about each person.

This dataframe will be indexed by orcid_id: one row per person. And let's start by generating a phd_year column with the year in which they completed their PhD.

Note: The value of phd_year will be 2016 for ongoing PhD students.

In [2]:
df = pd.read_csv('ORCID_migrations_2016_12_16.csv', index_col = 0)
In [3]:
people = pd.DataFrame(index = df.orcid_id.unique())
people.index.name = 'orcid_id'
people["phd_year"] = df[(df.is_phd) 
                        & (df.affiliation_type == "EDUCATION")
                       ].groupby("orcid_id").end_year.max()
people.head()
Out[3]:
phd_year
orcid_id
0000-0001-5000-0138 NaN
0000-0001-5000-0736 2006.0
0000-0001-5000-1018 2015.0
0000-0001-5000-1181 NaN
0000-0001-5000-1923 2016.0
In [4]:
len(people)
Out[4]:
741867

There you have the year (if any) of PhD completion for each of the 740,000 ORCID profiles that include affiliations.

Now let's determine each person's country of residence in 2016.

In [5]:
today_countries = df[df.end_year == 2016].drop_duplicates(subset = 'orcid_id', 
                                                  keep = 'first')
today_countries = today_countries.set_index("orcid_id").country
today_countries.name = "country_2016"
today_countries.head()
Out[5]:
orcid_id
0000-0001-5000-0138    CO
0000-0001-5000-1018    US
0000-0001-5000-1181    RU
0000-0001-5000-1923    GB
0000-0001-5000-223X    GB
Name: country_2016, dtype: object
In [7]:
people = pd.merge(people, today_countries.to_frame(), how = 'left',
                  left_index = True, right_index = True)
people.head()
Out[7]:
phd_year country_2016
orcid_id
0000-0001-5000-0138 NaN CO
0000-0001-5000-0736 2006.0 NaN
0000-0001-5000-1018 2015.0 US
0000-0001-5000-1181 NaN RU
0000-0001-5000-1923 2016.0 GB
In [8]:
len(people), people.index.nunique()
Out[8]:
(741867, 741867)

Good. So now we can see where each person ended up in 2016, if they have an affiliation that ended in 2016 or is ongoing.

We don't know the true nationalities of people with ORCID profiles, but we can infer them from the country of their earliest affiliation. So let's find the year and country of each person's earliest affiliation.

In [9]:
people['earliest_year'] = df.groupby("orcid_id").start_year.min()
people.head()
Out[9]:
phd_year country_2016 earliest_year
orcid_id
0000-0001-5000-0138 NaN CO 2014.0
0000-0001-5000-0736 2006.0 NaN NaN
0000-0001-5000-1018 2015.0 US 2005.0
0000-0001-5000-1181 NaN RU 1978.0
0000-0001-5000-1923 2016.0 GB 2004.0
In [10]:
earliest_affs = df[df.start_year == df.groupby('orcid_id').start_year.transform(min)]
earliest_affs = earliest_affs.drop_duplicates(subset = 'orcid_id', 
                                      keep = 'first').set_index('orcid_id').country
earliest_affs.name = 'earliest_country'
people = pd.merge(people, earliest_affs.to_frame(), how = 'left',
                  right_index = True, left_index = True)
people.head()
Out[10]:
phd_year country_2016 earliest_year earliest_country
orcid_id
0000-0001-5000-0138 NaN CO 2014.0 CO
0000-0001-5000-0736 2006.0 NaN NaN NaN
0000-0001-5000-1018 2015.0 US 2005.0 US
0000-0001-5000-1181 NaN RU 1978.0 RU
0000-0001-5000-1923 2016.0 GB 2004.0 GB

Good. Now let's mark people who obtained a PhD, and let's list the country in which they got their PhD.

In [12]:
people['has_phd'] = df.groupby('orcid_id').is_phd.max()
people.head()
Out[12]:
phd_year country_2016 earliest_year earliest_country has_phd
orcid_id
0000-0001-5000-0138 NaN CO 2014.0 CO False
0000-0001-5000-0736 2006.0 NaN NaN NaN True
0000-0001-5000-1018 2015.0 US 2005.0 US True
0000-0001-5000-1181 NaN RU 1978.0 RU False
0000-0001-5000-1923 2016.0 GB 2004.0 GB True
In [13]:
phd_country = df[(df.affiliation_type == 'EDUCATION') & (df.is_phd)]
phd_country = phd_country.drop_duplicates(subset = 'orcid_id', 
                                      keep = 'first').set_index('orcid_id').country
phd_country.name = 'phd_country'
people = pd.merge(people, phd_country.to_frame(), how = 'left',
                  right_index = True, left_index = True)
people.head()
Out[13]:
phd_year country_2016 earliest_year earliest_country has_phd phd_country
orcid_id
0000-0001-5000-0138 NaN CO 2014.0 CO False NaN
0000-0001-5000-0736 2006.0 NaN NaN NaN True PT
0000-0001-5000-1018 2015.0 US 2005.0 US True US
0000-0001-5000-1181 NaN RU 1978.0 RU False NaN
0000-0001-5000-1923 2016.0 GB 2004.0 GB True GB

Let's get some summary statistics based on these biographical data...

In [36]:
# total people who have both an earliest country and a 2016 country
len(people[(people.earliest_country.notnull()) & 
           (people.country_2016.notnull())])
Out[36]:
499670
In [37]:
# total people in 2016 in a country different from their earliest country
len(people[(people.earliest_country.notnull()) & 
           (people.country_2016.notnull()) &
           (people.country_2016 != people.earliest_country)])
Out[37]:
75450

So about 15% of people (75,000 / 500,000) live in a country in 2016 different from their earliest affiliation.

Let's look at the distribution of PhD degrees.

In [26]:
len(people), len(people[people.has_phd])
Out[26]:
(741867, 329291)

44% of the ORCID profiles that include any affiliations list a PhD degree.

Where did they get those PhD degrees?

In [35]:
phds = people[people.has_phd].phd_country.value_counts()
phds.name = "PhDs"
phds = phds.to_frame()
phds["% of total"] = round(phds.PhDs / len(people[people.has_phd]) * 100, 2)
phds.head(10)
Out[35]:
PhDs % of total
US 71031 21.57
GB 31294 9.50
ES 21099 6.41
IN 17444 5.30
CN 16221 4.93
AU 13584 4.13
IT 11557 3.51
BR 9862 2.99
PT 8765 2.66
FR 8515 2.59

So of the 330,000 people with a PhD, 22% got it in the US, 10% in the UK, 6% in Spain (doubtful), 5% in India, and 5% in China.

NOTE: Clearly some biases in the rate at which people from various countries have created ORCID profiles for themselves. Proceed with caution.

Save the people data to local directory. They are ready for play. Have fun!

In [39]:
people.to_csv('ORCID_migrations_2016_12_16_by_person.csv')