A tour of ORCID

The largest public data set of scientific careers

This Notebook provides a quick tour and analysis of a data set provided to John Bohannon by ORCID on 16 December 2016. Though ORCID was not intended as a massive longitudinal survey of the careers of the world's scientists, it is growing into one. To gain insights into this global population of knowledge-producers, you will need to account for biased sampling across countries and professional domains, demographic skew, and many quirks yet to be discovered. May this notebook guide you on your journey!

Science article:

Data and code:

In [1]:
import pandas as pd
import numpy as np
import sys
In [2]:
# Load the data
affs = pd.read_csv('ORCID_migrations_2016_12_16.csv', index_col = 0)
people = pd.read_csv('ORCID_migrations_2016_12_16_by_person.csv', index_col = 0)

In the affs dataframe, each row is a single affiliation. For a full description of the original XML data from ORCID, consult the docs.

In [43]:
affs.head()
Out[43]:
orcid_id country organization_name Ringgold_id start_year end_year affiliation_type affiliation_role is_phd
0 0000-0001-5000-0138 CO Universidad Del Rosario 25807 2014.0 2016.0 EMPLOYMENT Profesor Principal False
1 0000-0001-5000-0736 PT Faculty of Science and Technology, New Univers... NaN NaN NaN EMPLOYMENT PhD in Geology - Invited Assistant Professor True
2 0000-0001-5000-0736 PT Universidade de Évora Área Departamental de Ci... 98820 NaN 2006.0 EDUCATION PhD in Geology True
3 0000-0001-5000-0736 PT Universidade de Lisboa 37809 NaN 1997.0 EDUCATION MSc in Internal Geodynamics False
4 0000-0001-5000-0736 PT Universidade de Lisboa 37809 NaN 1992.0 EDUCATION Graduation in Geology False

In the people dataframe, each row is a single person. To reality-check the data for any given person, visit that person's public ORCID profile. For example, the first person who appears in people has this profile: http://orcid.org/0000-0001-5000-0138.

In [44]:
people.head()
Out[44]:
phd_year country_2016 earliest_year earliest_country has_phd phd_country has_migrated country_2013
orcid_id
0000-0001-5000-0138 NaN CO 2014.0 CO False NaN False NaN
0000-0001-5000-0736 2006.0 NaN NaN NaN True PT False NaN
0000-0001-5000-1018 2015.0 US 2005.0 US True US False US
0000-0001-5000-1181 NaN RU 1978.0 RU False NaN False RU
0000-0001-5000-1923 2016.0 GB 2004.0 GB True GB False GB

When did the people in ORCID do their PhD degrees?

In [143]:
%matplotlib inline

people[(people.has_phd) & 
       (people.phd_year > 1960) &
       (people.phd_year < 2017)].phd_year.hist(bins = 56)
Out[143]:
<matplotlib.axes._subplots.AxesSubplot at 0x114ef8c88>

This shows a quirk of how I coded the end_year for EDUCATION affiliations that were ongoing as of 2016. That spike includes not only people who completed their PhD in 2016 but also everyone with 1 year left to completion, as well as 2 years left, and so forth. To see what the distribution of annual PhD degrees conferred looks like, we have to truncate...

In [153]:
%matplotlib inline

people[(people.has_phd) & 
       (people.phd_year > 1960) &
       (people.phd_year < 2015)].phd_year.hist(bins = 2015 - 1960 - 1)
Out[153]:
<matplotlib.axes._subplots.AxesSubplot at 0x117f09f28>

That's reassuringly smooth. But it does show that ORCID is almost certainly skewed toward younger researchers, since the number of PhD degrees awarded each year is not really growing exponentially. Or at least I hope not.

Where have the people in ORCID been migrating?

In [120]:
def track_immigration(country, affs, people, start, end, has_phd = False,
                                                         foreign_phd = False,
                                                         foreign_earliest = True):
    ''' Track the annual migration of `people` to a single `country` 
        from `start` year through and including `end` year 
        based on affiliations dataframe `affs`. '''

    # Filter people
    if has_phd:
        # only include people with a PhD
        people = people[people.has_phd]
    if foreign_phd:
        # only include people who did their PhD outside of `country`
        people = people[people.phd_country != country]
    if foreign_earliest:
        # Only include people whose earliest affiliation is outside of `country`
        people = people[people.earliest_country != country]
    
    # Initialize a dict of migrations per year.
    data = dict([(year, 0) for year in range(start, end + 1)])        
    
    # now filter affs to only these people
    affs = affs[affs.orcid_id.isin(people.index)]
    
    # Keep track of total pool of existing foreigners in each year
    totals = dict()
    
    # Process the data year by year
    for year in range(start, end + 1):
        # how big is pool of foreigners who could migrate this year?
        totals[year] = affs[(affs.orcid_id.isin(people.index)) &
                            (affs.start_year <= year)]\
                           .orcid_id.nunique()
        # how many migrations to `country` happened this year?
        data[year] = affs[(affs.start_year == year)
                          & (affs.country == country)].orcid_id.nunique()
        
    # return migrations data and the dict of total foreigners per year to normalize
    return data, totals

def track_all_migrations(people, has_phd = True, phd_year_min = None, phd_year_max = None):
    ''' Track the migration of `people` from phd_country to country_2016. Optionally, 
        limit this analysis to cohort of PhD's obtained between `phd_year_min` and 
        `phd_year_max`. '''

    # Filter people
    if has_phd:
        # only include people with a PhD
        people = people[people.has_phd]
    if phd_year_min != None:
        # only include people who got PhD no earlier than phd_year_min
        people = people[people.phd_year >= phd_year_min]
    if phd_year_max != None:
        # only include people who got PhD no earlier than phd_year_min
        people = people[people.phd_year <= phd_year_max]
    # clean out people with no phd_country or country_2016
    people = people[(people.phd_country.notnull()) & (people.country_2016.notnull())]
    
    # initialize (countries)x(countries) dataframe with zeroes
    countries = pd.concat([people.phd_country, people.country_2016])\
                        .dropna().sort_values().drop_duplicates().tolist()
    data = pd.DataFrame([len(countries)*[0] for i in countries], 
                        index = countries, columns = countries)
    
    # process the data...
    # rows are phd_country, columns are country_2016
    for n, p in enumerate(people.itertuples()):
        # track progress
        if n % 500 == 0:
            sys.stdout.write("\r{0}".format(n))
            sys.stdout.flush()
        # update data
        data.loc[p[6], p[2]] += 1
    
    return data
In [132]:
data, totals = track_immigration('US', affs, people, 1995, 2014)
In [133]:
test = pd.Series(data).to_frame()
test.columns = ['migrations']
test['total_foreigners'] = pd.Series(totals)
test['normalized_migrations'] = test.migrations / test.total_foreigners
test.tail()
Out[133]:
migrations total_foreigners normalized_migrations
2010 3548 410049 0.008653
2011 3858 431795 0.008935
2012 4240 453013 0.009360
2013 4537 471753 0.009617
2014 4713 488241 0.009653
In [134]:
%matplotlib inline

test.migrations.plot()
Out[134]:
<matplotlib.axes._subplots.AxesSubplot at 0x1135506a0>

That's a suspicious slump after 2001 in the rate of immigration of foreign researchers into the US. If we normalize that absolute annual influx by the total number of foreigners who were available to immigrate each year, what does that look like?

In [135]:
test.normalized_migrations.plot()
Out[135]:
<matplotlib.axes._subplots.AxesSubplot at 0x113a5bda0>

So it looks like there was a 15% drop in the rate of immigration of foreign researchers into the US after 2001, but then it recovered by 2008 and grew to record levels over the Obama years.

To create the Sankey diagram in the Science article, let's generate a (phd_country)x(country_2016) matrix of all people who have a PhD...

In [29]:
data = track_all_migrations(people)
237000
In [30]:
data.head()
Out[30]:
AD AE AF AG AI AL AM AO AR AS ... VC VE VI VN VU WF YE ZA ZM ZW
AD 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AE 0 32 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AF 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AG 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AI 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 206 columns

In [32]:
data.to_csv('ORCID_PhD_country_to_2016_country.csv')

What sort of stories can be found in these data?

Let's look at the migration patterns of researchers from the "majority Muslim" nations who recently faced a US immigration ban by the Trump administration...

In [18]:
# Iran, Iraq, Syria, Sudan, Libya, Somalia, Yemen
muslim_nations = 'IR IQ SY SD LY SO YE'.split()

First get the orcid_id of everyone who has at least one muslim_nations affiliation.

In [20]:
musnat = affs[affs.country.isin(muslim_nations)]
musnatpeople = people[people.index.isin(musnat.orcid_id)]
len(musnatpeople)
Out[20]:
18268

How many of these 18,000 people have one of the muslim_nations as their country of earliest affiliation? And where are they from?

In [21]:
len(musnatpeople[musnatpeople.earliest_country.isin(muslim_nations)])
Out[21]:
12909
In [22]:
musnatpeople[musnatpeople.earliest_country.isin(muslim_nations)]\
    .earliest_country.value_counts().head()
Out[22]:
IR    10428
IQ     1414
SD      418
SY      315
LY      146
Name: earliest_country, dtype: int64

So these are about 90% Iranians.

How many have PhD's and where did they do their PhD's?

In [90]:
len(musnatpeople[musnatpeople.has_phd])
Out[90]:
8176
In [91]:
musnatpeople.phd_country.value_counts().head()
Out[91]:
IR    4552
GB     530
IQ     483
US     477
MY     367
Name: phd_country, dtype: int64

And where are all the muslim_nation people in 2016?

In [92]:
musnatpeople.country_2016.value_counts().head()
Out[92]:
IR    7097
IQ    1078
US     563
GB     214
SD     204
Name: country_2016, dtype: int64

Now let's look at the migrations of muslim_nation people to the US...

In [116]:
data, totals = track_migrations(df, musnatpeople, 'US', 1985)
test = pd.Series(data).to_frame()
test.columns = ['migrations']
test['total_foreigners'] = pd.Series(totals)
test['normalized_migrations'] = test.migrations / test.total_foreigners
test
Out[116]:
migrations total_foreigners normalized_migrations
1985 2 778 0.002571
1986 4 879 0.004551
1987 5 997 0.005015
1988 6 1130 0.005310
1989 3 1257 0.002387
1990 4 1444 0.002770
1991 8 1613 0.004960
1992 7 1841 0.003802
1993 6 2067 0.002903
1994 2 2329 0.000859
1995 4 2586 0.001547
1996 6 2848 0.002107
1997 3 3124 0.000960
1998 7 3530 0.001983
1999 10 3924 0.002548
2000 11 4424 0.002486
2001 17 4873 0.003489
2002 17 5362 0.003170
2003 12 5968 0.002011
2004 13 6559 0.001982
2005 17 7214 0.002357
2006 25 7908 0.003161
2007 34 8549 0.003977
2008 46 9270 0.004962
2009 69 9939 0.006942
2010 73 10616 0.006876
2011 114 11356 0.010039
2012 150 12102 0.012395
2013 153 12681 0.012065
2014 175 13199 0.013259
2015 134 13486 0.009936
In [117]:
%matplotlib inline

test.total_foreigners.plot()
Out[117]:
<matplotlib.axes._subplots.AxesSubplot at 0x128ed8a58>
In [118]:
test.migrations.plot()
Out[118]:
<matplotlib.axes._subplots.AxesSubplot at 0x135c35400>
In [119]:
test.normalized_migrations.plot()
Out[119]:
<matplotlib.axes._subplots.AxesSubplot at 0x13de7ce48>

The sample size is terribly small. But about 10% of muslim_nations people with PhD's migrated to the US in 2016, and that fraction has been growing steadily during the Obama years. You do see the same post-2001 slump that you see for all foreigner migrations to the US, and the same big recovery over the Obama years. But the data are very noisy.

In [28]:
data = track_all_migrations(musnatpeople)
data.head()
5500
Out[28]:
AE AF AM AT AU AZ BE BH BR BY ... SS SY TH TJ TR TW UA US YE ZA
AE 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AF 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AM 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AT 0 0 0 10 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
AU 1 0 0 0 137 0 0 0 0 0 ... 0 0 0 0 0 0 0 4 0 0

5 rows × 76 columns

In [29]:
data.to_csv('mostly_muslim_migrations.csv')

Let's just look at Iranians...

Anyone with any affiliation in Iran is likely to be an Iranian national. How many are there?

In [11]:
iranians = people[people.index.isin(affs[affs.country == "IR"].orcid_id)]
len(iranians)
Out[11]:
15001

15,000 of them. How many have a PhD?

In [12]:
len(iranians[iranians.has_phd])
Out[12]:
6733

Nearly half of them have a PhD. Where did they do their PhD?

In [13]:
iranians.phd_country.value_counts().head(10)
Out[13]:
IR    4552
US     414
GB     299
CA     230
AU     218
MY     211
SE      72
TR      60
DE      59
FR      54
Name: phd_country, dtype: int64

68% got their PhD in Iran, 6% in the US, and the rest elsewhere.

And where did the Iranians who got a PhD end up in 2016?

In [14]:
iranians[iranians.has_phd].country_2016.value_counts().head(10)
Out[14]:
IR    3583
US     364
AU     140
CA     134
GB      92
MY      59
DE      55
SE      51
TR      45
PT      38
Name: country_2016, dtype: int64

These are small numbers, but it does show how extremely migratory Iranian scientists are. Of the 6733 Iranians with a PhD in the ORCID data, only half (53%) are in Iran in 2016. About 5% are in the US and the rest are spread around the world.

Let's generate data for the Venn diagram in the Science article.

In [6]:
print('''
total people: {0}
has PhD: {1}
resides in US in 2016: {2}
'''.format(len(people), len(people[people.has_phd]), 
           len(people[people.phd_country == 'US']), 
           len(people[people.country_2016 == 'US'])))
total people: 741867
has PhD: 329291
resides in US in 2016: 71031

Let's add a variable to people called has_migrated. If >=2 years in different country or ongoing in new country, then has_migrated == True.

In [94]:
affs['duration'] = affs.end_year - affs.start_year
affs.head()
Out[94]:
orcid_id country organization_name Ringgold_id start_year end_year affiliation_type affiliation_role is_phd duration
0 0000-0001-5000-0138 CO Universidad Del Rosario 25807 2014.0 2016.0 EMPLOYMENT Profesor Principal False 2.0
1 0000-0001-5000-0736 PT Faculty of Science and Technology, New Univers... NaN NaN NaN EMPLOYMENT PhD in Geology - Invited Assistant Professor True NaN
2 0000-0001-5000-0736 PT Universidade de Évora Área Departamental de Ci... 98820 NaN 2006.0 EDUCATION PhD in Geology True NaN
3 0000-0001-5000-0736 PT Universidade de Lisboa 37809 NaN 1997.0 EDUCATION MSc in Internal Geodynamics False NaN
4 0000-0001-5000-0736 PT Universidade de Lisboa 37809 NaN 1992.0 EDUCATION Graduation in Geology False NaN

For the purpose of this analysis, we should code EDUCATION affiliations with no start_year and no end_year as duration == 2. Doing a degree in a country is definitely a real residency. So it should count for migration...

In [95]:
def fix_education_duration(row):
    if row.affiliation_type == 'EDUCATION' and np.isnan(row.duration):
        return 2
    else:
        return row.duration
        
affs_duration = affs.apply(lambda x: fix_education_duration(x), axis = 1)
affs_duration.head()
Out[95]:
0    2.0
1    NaN
2    2.0
3    2.0
4    2.0
dtype: float64
In [96]:
affs.duration = affs_duration
affs[affs.duration > 1].head()
Out[96]:
orcid_id country organization_name Ringgold_id start_year end_year affiliation_type affiliation_role is_phd duration
0 0000-0001-5000-0138 CO Universidad Del Rosario 25807 2014.0 2016.0 EMPLOYMENT Profesor Principal False 2.0
2 0000-0001-5000-0736 PT Universidade de Évora Área Departamental de Ci... 98820 NaN 2006.0 EDUCATION PhD in Geology True 2.0
3 0000-0001-5000-0736 PT Universidade de Lisboa 37809 NaN 1997.0 EDUCATION MSc in Internal Geodynamics False 2.0
4 0000-0001-5000-0736 PT Universidade de Lisboa 37809 NaN 1992.0 EDUCATION Graduation in Geology False 2.0
7 0000-0001-5000-1018 US Purdue University 311308 2012.0 2015.0 EDUCATION Ph.D. Electrical and Computer Engineering True 3.0
In [97]:
migration = affs[affs.duration >= 2].groupby('orcid_id').country.nunique()
migration[migration > 1].head()
Out[97]:
orcid_id
0000-0001-5000-223X    2
0000-0001-5000-4542    2
0000-0001-5000-5369    2
0000-0001-5000-8017    2
0000-0001-5001-2171    2
Name: country, dtype: int64
In [98]:
migration.name = 'has_migrated'
In [99]:
def detect_migration(x):
    if x >= 2:
        return True
    else:
        return False
    
migration = migration.apply(detect_migration)
migration[migration == True].head()
Out[99]:
orcid_id
0000-0001-5000-223X    True
0000-0001-5000-4542    True
0000-0001-5000-5369    True
0000-0001-5000-8017    True
0000-0001-5001-2171    True
Name: has_migrated, dtype: bool
In [100]:
migration.head()
Out[100]:
orcid_id
0000-0001-5000-0138    False
0000-0001-5000-0736    False
0000-0001-5000-1018    False
0000-0001-5000-1181    False
0000-0001-5000-1923    False
Name: has_migrated, dtype: bool
In [101]:
people = pd.merge(people, migration.to_frame(), how = 'left', left_index = True, right_index = True)
people.head()
Out[101]:
phd_year country_2016 earliest_year earliest_country has_phd phd_country has_migrated
orcid_id
0000-0001-5000-0138 NaN CO 2014.0 CO False NaN False
0000-0001-5000-0736 2006.0 NaN NaN NaN True PT False
0000-0001-5000-1018 2015.0 US 2005.0 US True US False
0000-0001-5000-1181 NaN RU 1978.0 RU False NaN False
0000-0001-5000-1923 2016.0 GB 2004.0 GB True GB False
In [102]:
people.has_migrated = people.has_migrated.fillna(False)
In [103]:
len(people), len(people[people.has_migrated])
Out[103]:
(741867, 111149)
In [104]:
print('''
people: {0}
has PhD: {1}
has migrated: {2}
in US today: {3}
'''.format(len(people), 
           len(people[people.has_phd]), 
           len(people[people.has_migrated]),
           len(people[people.country_2016 == 'US'])))
people: 741867
has PhD: 329291
has migrated: 111149
in US today: 88930

In [105]:
print('''
has PhD and has migrated: {0}
has PhD and in US today: {1}
has migrated and in US today: {2}
has PhD and has migrated and in US today: {3}
'''.format(
len(people[(people.has_phd) & (people.has_migrated)]),
len(people[(people.has_phd) & (people.country_2016 == 'US')]),
len(people[(people.has_migrated) & (people.country_2016 == 'US')]),
len(people[(people.has_phd) & (people.has_migrated) & (people.country_2016 == 'US')])))
has PhD and has migrated: 84389
has PhD and in US today: 52499
has migrated and in US today: 17047
has PhD and has migrated and in US today: 13295

In [106]:
# might be useful later, so let's save this new people column
people.to_csv('ORCID_migrations_2016_12_16_by_person.csv')

Do the broad demographic patterns in ORCID match other population studies of scientists?

Let's try to replicate the "Foreign Fractions" analysis from the 2012 Nature news story based on the GlobSci survey.

In [8]:
print('''
in Switzerland in 2016: {0}
in Switzerland in 2016 and has migrated: {1}
in Switzerland in 2016 and earliest country and PhD country not Switzerland: {2}
'''.format(len(people[(people.country_2016 == 'CH')]),
           len(people[(people.country_2016 == 'CH') & (people.has_migrated)]),
           len(people[(people.country_2016 == 'CH') & (people.earliest_country != 'CH') &
                      (people.phd_country != 'CH')])))
in Switzerland in 2016: 3438
in Switzerland in 2016 and has migrated: 1602
in Switzerland in 2016 and earliest country and PhD country not Switzerland: 1358

We get a similar fraction of foreigners in Switzerland (1358/3438 = 40%). Nature cited 57%.

In [12]:
def foreign_fraction(country):
    foreigners = people[(people.country_2016 == country) & 
           (people.earliest_country != country) & 
           (people.phd_country != country)]
    return foreigners.earliest_country.value_counts().head()
In [13]:
foreign_fraction('CH')
Out[13]:
DE    250
FR    155
IT    153
US    146
GB    105
Name: earliest_country, dtype: int64

Nice. We also get Germany as the largest source of foreigner researchers in Switzerland.

In [14]:
foreign_fraction('GB')
Out[14]:
US    682
IT    426
ES    392
DE    348
FR    294
Name: earliest_country, dtype: int64

We also get Germans and Italians as top foreign fractions for the UK but our sample is enriched for US probably because a large portion of Brits went to the US for a PhD and came home.

In [15]:
foreign_fraction('AU')
Out[15]:
GB    669
US    443
CN    248
NZ    163
CA    133
Name: earliest_country, dtype: int64

We also get the UK ('GB') and China as top fractions of foreigners in Australia.

In [16]:
foreign_fraction('SE')
Out[16]:
DE    139
US    127
GB    126
IT     88
CN     69
Name: earliest_country, dtype: int64

We also get Germany as the top fraction for Sweden.

In [18]:
foreign_fraction('FR')
Out[18]:
IT    178
US    128
ES    125
GB    123
DE     89
Name: earliest_country, dtype: int64

And indeed we also get Italians as the top fraction for France. This looks great.

In [20]:
foreign_fraction('CN')
Out[20]:
US    392
GB    155
JP    118
CA     63
AU     60
Name: earliest_country, dtype: int64

The 2012 GlobSci survey didn't include China, but here's what that looks like. It's most likely all Chinese nationals here who did degrees abroad and returned home.

In [22]:
foreign_fraction('US')
Out[22]:
CN    1852
IN    1337
GB     951
CA     915
DE     474
Name: earliest_country, dtype: int64
In [28]:
US_foreigners = len(people[(people.country_2016 == 'US') & (people.has_migrated)])
print('Chinese researchers in US (ORCID):', 1852 / US_foreigners)
print('Indian researchers in US (ORCID):', 1337 / US_foreigners)
Chinese researchers in US (ORCID): 0.10864081656596468
Indian researchers in US (ORCID): 0.07843022232650906
In [32]:
print('China/India ratio from GlobSci:', 16.9 / 12.3)
print('China/India ratio from ORCID:  ', 1852 / 1337)
China/India ratio from GlobSci: 1.3739837398373982
China/India ratio from ORCID:   1.3851907255048617
In [31]:
foreign_fraction('US').sum() / len(people[people.country_2016 == 'US'])
Out[31]:
0.062172495220960307

ORCID has nearly identical results as GlobSci for distribution of nationalities of researchers in US. It's just that total number of 'foreigners' that we can detect based on affiliations is much lower (6.2% compared to GlobSci's 38.4%).

Let's see if we can replicate a statistic from the 2015 NSF report "Profile of Early Career Doctorates"

"Of the estimated 43,300 foreign-trained early career doctorates, almost one-half earned their first doctoral degree from academic institutions in Canada, China, England, India, and Germany (figure 1)."

In [58]:
# recent career foreign-trained scientists in US
len(people[(people.country_2016 == 'US') & 
           (people.phd_country != 'US') &
           (people.phd_year > 2010)])
Out[58]:
3314
In [60]:
# top PhD country of these foreign-trained scientists
people[(people.country_2016 == 'US') & 
       (people.phd_country != 'US') & 
       (people.phd_year > 2010)].phd_country.value_counts().head()
Out[60]:
CN    494
GB    366
CA    339
IN    242
DE    192
Name: phd_country, dtype: int64
In [61]:
# total scientists from those top-5 countries
people[(people.country_2016 == 'US') & 
       (people.phd_country != 'US') & 
       (people.phd_year > 2010)].phd_country.value_counts().head().sum()
Out[61]:
1633
In [72]:
# proportion of those top-5 compared to all foreign-trained early career scientists in US
"{:.2f}".format(1633/3314)
Out[72]:
'0.49'

Nailed it. The ORCID profiles also show about "one-half" of foreign-trained early career scientists in the US got their PhD in "Canada, China, England, India, and Germany".

Let's try to replicate the global distribution of researchers from the UNESCO "Science Report, Towards 2030"

"The Big Five (China, European Union, Japan, Russian Federation and USA) still account for 72% of researchers worldwide but the share of China has progressed considerably since 2009, to the detriment of Japan, the Russian Federation and the USA. The share of the European Union (7.1% of the global population) has remained stable, at 22.2% in 2013, compared to 22.5% in 2009. Europe as a whole (11.4% of the global population) hosts 31% of the world’s researchers."

In [20]:
eu = ['AT', 'BE', 'BG', 'GB', 'HR', 'CY', 'CZ', 
      'DK', 'EE', 'ES', 'FI', 'FR', 'DE', 'GR', 
      'HU', 'IE', 'IT', 'LV', 'LT', 'LU', 'MT',
      'NL', 'PL', 'PT', 'RO', 'SK', 'ES', 'SE']
In [86]:
len(people[people.country_2016.isin(eu)]) / len(people)
Out[86]:
0.20625529913043714

Not bad! We see 21% of the global population of scientists in EU countries, just 1% shy of the UN figures.

In [87]:
len(people[people.country_2016 == 'CN']) / len(people)
Out[87]:
0.03448731376378785

Clearly China is highly underrepresented (by about 16%). So is the 21% EU a coincidence? Let's see if we get close to 72% for the "Big Five".

In [88]:
big_5 = eu + ['US', 'CN', 'JP', 'RU']
len(people[people.country_2016.isin(big_5)]) / len(people)
Out[88]:
0.38967766459486675

Nope. The Big Five are underrepresented by about 35%.

In [101]:
people.country_2016.value_counts().head(20) / len(people)
Out[101]:
US    0.119873
BR    0.044120
GB    0.043707
IN    0.036740
CN    0.034487
ES    0.033914
IT    0.026483
AU    0.022309
RU    0.019443
PT    0.019021
DE    0.014097
KR    0.013153
FR    0.012948
UA    0.012671
CA    0.012556
SE    0.011803
CO    0.010831
MX    0.009778
JP    0.009619
IR    0.009566
Name: country_2016, dtype: float64

UNESCO has the US at 16% of the global pool and we get 12%. Russia and Japan are supposed to have more like 6% and 8%. So there's 15% missing which, along with the 16% missing from China, accounts for the underrepresentation of the Big Five.

But the UNESCO stats are for the year 2013. Let's look at the global distribution of ORCID people in 2013...

In [13]:
# We only want affiliations that started on or before 2013
# and ended on or later than 2013
country_2013 = affs[(affs.start_year <= 2013) & (affs.end_year >= 2013)]

len(affs), len(country_2013), country_2013.orcid_id.nunique()
Out[13]:
(1988114, 622205, 486260)

We have 0.5 million people with 0.6 million affiliations in 2013. Are any of those 2013 affiliations ongoing as of 2016?

In [14]:
len(country_2013[country_2013.end_year.isnull()])
Out[14]:
0

Nope! All accounted for. Great, so now let's find where people were in 2013...

In [16]:
country_2013 = country_2013.sort_values(by = ['orcid_id', 
                                              'end_year']).drop_duplicates(subset = 'orcid_id', 
                                                                           keep = 'last')
country_2013.head()
Out[16]:
orcid_id country organization_name Ringgold_id start_year end_year affiliation_type affiliation_role is_phd
8 0000-0001-5000-1018 US Purdue University 311308 2010.0 2015.0 EMPLOYMENT Research Assistant False
12 0000-0001-5000-1181 RU Дальневосточная государственная академия физич... NaN 1982.0 2016.0 EMPLOYMENT профессор False
15 0000-0001-5000-1923 GB Northumbria University 5995 2013.0 2014.0 EDUCATION MSc Sport and Exercise Psychology False
17 0000-0001-5000-223X GB University of Liverpool 4591 2012.0 2016.0 EMPLOYMENT Senior Lecturer in Plant Metabolism False
34 0000-0001-5000-3822 CA University of Calgary NaN 2013.0 2016.0 EMPLOYMENT Teaching Assistant False
In [17]:
country_2013 = country_2013.set_index('orcid_id').country
In [18]:
country_2013.name = 'country_2013'
In [19]:
people['country_2013'] = country_2013
people.head()
Out[19]:
phd_year country_2016 earliest_year earliest_country has_phd phd_country has_migrated country_2013
orcid_id
0000-0001-5000-0138 NaN CO 2014.0 CO False NaN False NaN
0000-0001-5000-0736 2006.0 NaN NaN NaN True PT False NaN
0000-0001-5000-1018 2015.0 US 2005.0 US True US False US
0000-0001-5000-1181 NaN RU 1978.0 RU False NaN False RU
0000-0001-5000-1923 2016.0 GB 2004.0 GB True GB False GB

So the country distributions don't change much between 2013 and 2016 in ORCID. The big bias seems to be that researchers based in China and Russia are under-represented. But on the whole, the sampling of the world's researchers by ORCID and the UNESCO survey seem broadly similar. And as more scientists add their affiliation data to their public ORCID profiles, the ORCID data set will paint an increasingly accurate picture of this population.