Data-Related Practices in Psychology

Notebook Contents

  • In general, this notebook follows the structure of the our survey instrument. The full text of the survey- as it was presented to study participants- is included as Borghi_VanGulick_PsychRDM_Survey.pdf.

  • The CSV file containing the data used in this notebook is named Borghi_VanGulick_PsychRDM_Data.csv and is accompanied by a data dictionary (Borghi_VanGulick_PsychRDM_Dictionary.csv).


  1. General Information - Questions 1 through 16 of the survey.
  2. Data Collection - Questions 17 through 32.
  3. Data Analysis - Questions 33 through 45.
  4. Data Sharing - Questions 46 through 59.
  5. Emerging Practices - Questions 60 through 64.
  6. Comparisons and Analyses - Comparisons/stats
  7. Figures - Figures

Import packages and set defaults

In [1]:
#Import the packages needed to make this notebook work

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

#Make sure the figures appear in the notebook

%matplotlib inline
In [2]:
#Import the data

df = pd.read_csv("downloads/Data_cleaned.csv")

Section 1: Background Information

Description: Before we ask about your specific methods, tools, and data management practices, we have some general questions about you, your lab or research group, and your area of research. The information you provide in this section will help us contextualize your other survey responses.

Back to table of contents


1. What is your current professional title or role?

In [3]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_title = pd.DataFrame({"Number": df["title"].value_counts(),
                         "Percentage": df["title"].value_counts(normalize=True)*100})

#Display text entered by participants who selected "other".
df_title
Out[3]:
Number Percentage
Assistant Professor (US), Lecturer (UK), or equivalent 69 25.274725
Post-Doctoral Fellow 60 21.978022
Graduate Student 57 20.879121
Associate Professor (US), Senior Lecturer (UK), or equivalent 30 10.989011
Professor (US), Reader (UK), or equivalent 27 9.890110
Research Associate/Scientist 14 5.128205
Adjunct Professor or other non-tenure-track faculty 5 1.831502
Research Assistant (including undergraduate RA) 4 1.465201
Other (Please specify) 4 1.465201
Project Coordinator or Lab Manager 3 1.098901
In [4]:
#Display text entered by participants who selected "other".

df["title_other_text"].value_counts()
Out[4]:
Research Dean        1
Senior Researcher    1
Professor (UK)       1
Name: title_other_text, dtype: int64

2. For approximately how many years have you been actively doing psychology research

In [5]:
#Participants could only give one response to this question.

#Calculate overall descriptive statistics.

print("median   ", str(df["years"].median()))
print(df["years"].describe())
median    12.0
count    270.000000
mean      13.851852
std        8.756532
min        1.000000
25%        8.000000
50%       12.000000
75%       18.000000
max       55.000000
Name: years, dtype: float64
In [6]:
#Calculate group statistics grouped by professional title.

df_title_years = df.groupby("title")["years"].describe()
df_title_years["median"] = df.groupby("title")["years"].median()

df_title_years
Out[6]:
count mean std min 25% 50% 75% max median
title
Adjunct Professor or other non-tenure-track faculty 5.0 11.800000 5.848077 8.0 8.0 10.0 11.00 22.0 10.0
Assistant Professor (US), Lecturer (UK), or equivalent 68.0 13.808824 3.486680 4.0 12.0 14.0 15.25 23.0 14.0
Associate Professor (US), Senior Lecturer (UK), or equivalent 29.0 19.517241 4.314599 11.0 17.0 20.0 22.00 31.0 20.0
Graduate Student 57.0 6.500000 2.119215 1.0 5.0 7.0 8.00 12.0 7.0
Other (Please specify) 4.0 34.500000 13.892444 14.0 32.0 40.0 42.50 44.0 40.0
Post-Doctoral Fellow 59.0 10.466102 3.412533 4.0 8.5 10.0 12.00 20.0 10.0
Professor (US), Reader (UK), or equivalent 27.0 29.333333 11.361203 10.0 23.0 25.0 35.00 55.0 25.0
Project Coordinator or Lab Manager 3.0 15.000000 9.848858 4.0 11.0 18.0 20.50 23.0 18.0
Research Assistant (including undergraduate RA) 3.0 3.333333 2.516611 1.0 2.0 3.0 4.50 6.0 3.0
Research Associate/Scientist 14.0 14.285714 7.559289 4.0 8.5 12.0 19.50 27.0 12.0

3. Which of the following best describes the institution or organization with which you are affiliated?

In [7]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_institution_type = pd.DataFrame({"Number": df["institution_type"].value_counts(),
                                    "Percentage": df["institution_type"].value_counts(normalize=True)*100})

df_institution_type
Out[7]:
Number Percentage
Predominantly research focused university or college 151 55.109489
Equally research/teaching focused university or college 71 25.912409
Medical school or health sciences center 20 7.299270
Predominantly teaching focused university or college 17 6.204380
Government agency or research center 8 2.919708
Non-profit organization 4 1.459854
Commercial organization 2 0.729927
Other (Please describe) 1 0.364964
In [8]:
#Display text entered by participants who selected "other".

df["institution_type_other_text"].value_counts()
Out[8]:
Research Institute    1
Name: institution_type_other_text, dtype: int64

4. In what country is your current institution or organization located?

In [9]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_country = pd.DataFrame({"Number": df["country"].value_counts(),
                           "Percentage": df["country"].value_counts(normalize=True)*100})
df["country"].value_counts()
Out[9]:
 United States of America                                152
 United Kingdom of Great Britain and Northern Ireland     29
 Canada                                                   17
 Germany                                                  15
 Australia                                                12
 Spain                                                     5
 Netherlands                                               4
 France                                                    4
 Italy                                                     4
 Sweden                                                    3
 Japan                                                     2
 Austria                                                   2
 Hungary                                                   2
 Switzerland                                               2
 Belgium                                                   2
 Poland                                                    2
 Israel                                                    1
 Chile                                                     1
 New Zealand                                               1
 Argentina                                                 1
 South Africa                                              1
 Colombia                                                  1
 Singapore                                                 1
 Portugal                                                  1
 Korea, Republic of                                        1
 Iceland                                                   1
 Slovakia                                                  1
 Bangladesh                                                1
 Indonesia                                                 1
 Lebanon                                                   1
 Brazil                                                    1
Name: country, dtype: int64

5. How big is your lab or research group?

In [10]:
#Participants gave multiple free responses to this question.

#Create dictionary containing responses.

lab_size_variables = {"lab_size_total": "Total lab size",
                      "lab_size_ra": "Research assistants",
                      "lab_size_gs": "Graduate students",
                      "lab_size_pd": "Postdocs",
                      "lab_size_fts": "Full time staff",
                      "lab_size_pts": "Part time staff"}

#Create dataframe containing descriptive statistics.

df_lab_size = df[list(lab_size_variables.keys())].describe().T
df_lab_size["median"] = df[list(lab_size_variables.keys())].median().T

df_lab_size.set_index([list(lab_size_variables.values())], inplace=True)

df_lab_size
Out[10]:
count mean std min 25% 50% 75% max median
Total lab size 274.0 14.164234 19.205998 0.0 6.0 10.0 16.00 180.0 10.0
Research assistants 274.0 6.952555 12.919642 0.0 2.0 4.0 9.75 180.0 4.0
Graduate students 274.0 2.945255 3.034707 0.0 1.0 3.0 4.00 30.0 3.0
Postdocs 274.0 1.248175 1.973380 0.0 0.0 1.0 2.00 12.0 1.0
Full time staff 274.0 2.419708 10.507756 0.0 0.0 1.0 2.00 160.0 1.0
Part time staff 274.0 0.598540 1.988833 0.0 0.0 0.0 0.00 20.0 0.0

6. Aside from your own, approximately how many labs or research groups are you currently collaborating with in a way that involves sharing data?

In [11]:
#Participants gave a single free text response to this question.

print("median    ", str(df["collaboration"].median()))
print(df["collaboration"].describe())
median     2.0
count    268.000000
mean       3.361940
std        7.632449
min        0.000000
25%        1.000000
50%        2.000000
75%        4.000000
max      100.000000
Name: collaboration, dtype: float64

7. What is your primary research area?

In [12]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_area = pd.DataFrame({"Number": df["research_area"].value_counts(),
                        "Percentage": df["research_area"].value_counts(normalize=True)*100})

df_area
Out[12]:
Number Percentage
Social and Personality Psychology 62 22.627737
Cognitive Psychology 53 19.343066
Developmental Psychology 44 16.058394
Cognitive Neuroscience 32 11.678832
Clinical Psychology 30 10.948905
Industrial/Organizational Psychology 18 6.569343
Other (Please describe) 13 4.744526
Biopsychology or Behavioral Neuroscience 11 4.014599
Educational Psychology 5 1.824818
Health Psychology 4 1.459854
Quantitative Psychology 2 0.729927
In [13]:
#Display text entered by participants who selected "other".

df["research_area_text"].value_counts()
Out[13]:
Environmental psychology                1
Correctional psychology                 1
Psychophysics                           1
Psycholinguistics                       1
Behavioral Genetics                     1
Evolutionary psychology                 1
affective neuroscience                  1
Social Neuroscience                     1
Behavioral economics                    1
Assessment and psychological methods    1
Pediatric Psychology                    1
Affective psychology                    1
developmental neuroscience              1
Name: research_area_text, dtype: int64

8. Which of the following best describes the data you collect as part of your research?

In [14]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_data_type = pd.DataFrame({"Number": df["data_type"].value_counts(),
                             "Percentage": df["data_type"].value_counts(normalize=True)*100})

df_data_type
Out[14]:
Number Percentage
Primarily quantitative 236 86.131387
A mix of quantitative and qualitative 33 12.043796
Primarily qualitative 5 1.824818

9. Who currently funds your research or work?

In [15]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

funder_variables = {"funder_nih": "National Institutes of Health",
                    "funder_nsf": "National Science Foundation",
                    "funder_government": "Other government funding",
                    "funder_private": "Private foundation",
                    "funder_professional": "Professional organization/society",
                    "funder_commercial": "Commercial organization",
                    "funder_internal": "Internal grants (including startup)",
                    "funder_other": "Other",
                    "funder_none": "I do not have funding for my work"}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_funder = pd.DataFrame({"Number": df[list(funder_variables.keys())].count(),
                          "Percentage": df[list(funder_variables.keys())].count()/df["funder"].sum()*100})

df_funder.set_index([list(funder_variables.values())], inplace=True)
df_funder                    
Out[15]:
Number Percentage
National Institutes of Health 66 24.087591
National Science Foundation 48 17.518248
Other government funding 91 33.211679
Private foundation 57 20.802920
Professional organization/society 8 2.919708
Commercial organization 4 1.459854
Internal grants (including startup) 128 46.715328
Other 24 8.759124
I do not have funding for my work 31 11.313869
In [16]:
#Display text entered by participants who selected "other".

df["funder_other_text"].value_counts()
Out[16]:
University                                                                                             1
Lol                                                                                                    1
UK funder                                                                                              1
Binational grants                                                                                      1
None                                                                                                   1
ANR                                                                                                    1
ERC                                                                                                    1
German research foundation (DFG)                                                                       1
sshrc                                                                                                  1
European Union                                                                                         1
Medical Center                                                                                         1
Self-funded                                                                                            1
National Health and Medical Research Council, Australian Rotary Health, philanthropic organisations    1
Canadian govt                                                                                          1
Private donors                                                                                         1
International organizations (multi-country)                                                            1
non-profits                                                                                            1
I don’t know                                                                                         1
CDC                                                                                                    1
Private Founation                                                                                      1
Gifts from families                                                                                    1
(Swiss National Science Foundation, unsure if only the American is meant with the choice)              1
European Research Council                                                                              1
Name: funder_other_text, dtype: int64

10. What is your role on these grants?

In [17]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

funder_role_variables = {"funder_role_pi": "Primary Investigator",
                         "funder_role_ci": "Co-Investigator",
                         "funder_role_associate": "Faculty associate",
                         "funder_role_consultant": "Consultant",
                         "funder_role_pd": "Postdoc",
                         "funder_role_gs": "Graduate Student",
                         "funder_role_ug": "Undergraduate Student", 
                         "funder_role_other": "Other",
                         "funder_role_na": "Not Applicable"}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_funder_role = pd.DataFrame({"Number": df[list(funder_role_variables.keys())].count(),
                               "Percentage": df[list(funder_role_variables.keys())].count()/df["funder"].sum()*100})

df_funder_role.set_index([list(funder_role_variables.values())], inplace=True)
df_funder_role
Out[17]:
Number Percentage
Primary Investigator 134 48.905109
Co-Investigator 94 34.306569
Faculty associate 5 1.824818
Consultant 15 5.474453
Postdoc 50 18.248175
Graduate Student 45 16.423358
Undergraduate Student 3 1.094891
Other 9 3.284672
Not Applicable 21 7.664234
In [18]:
#Display text entered by participants who selected "other".

df["funder_role_other_text"].value_counts()
Out[18]:
key personnel                                                                     1
Project Coordinator                                                               1
research project manager                                                          1
key personell                                                                     1
Grad Student but it is my own grant (grant towards me as a person so to speak)    1
Research Associate                                                                1
Former post-doc -- still ongoing                                                  1
data manager and statistical consultant                                           1
Statistician                                                                      1
Name: funder_role_other_text, dtype: int64

11. Have you ever written a Data Management Plan (DMP) as part of a grant or project proposal?

In [19]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_dmp = pd.DataFrame({"Number": df["dmp"].value_counts(),
                       "Percentage": df["dmp"].value_counts(normalize=True)*100})

df_dmp
Out[19]:
Number Percentage
No 123 45.054945
Yes, though I generally do not revisit my DMPs after I submit a grant or project proposal. 98 35.897436
Yes, and I generally revist my DMPs throughout the course of a project. 40 14.652015
I dont know. 12 4.395604

In [20]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

education_methods_variables = {"education_methods_courses_school": "Workshops or courses during undergraduate/graduate education",
                               "education_methods_courses_other": "Workshops or courses not associated with undergraduate/graduate education",
                               "education_methods_best_practices_psych": "Guidance or best practices created by organizations/experts in psychology",
                               "education_methods_best_practices_other": "Guidance or best practices created by organizations/experts outside psychology",
                               "education_methods_person_collab": "From researchers who are in/collaborate with my research group",
                               "education_methods_person_other": "From researchers who are not in/do not collaborate with my research group",
                               "education_methods_social_media": "Through social media",
                               "education_methods_self": "Self education",
                               "education_methods_none": "I have recieved no training",
                               "education_methods_other": "Other"}
                               
#Create dataframe containing both the number and percentage of responding participants who selected each response.
    
df_education_methods = pd.DataFrame({"Number": df[list(education_methods_variables.keys())].count(),
                                     "Percentage": df[list(education_methods_variables.keys())].count()/df["education_method"].sum()*100})

df_education_methods.set_index([list(education_methods_variables.values())], inplace=True)
df_education_methods               
Out[20]:
Number Percentage
Workshops or courses during undergraduate/graduate education 238 87.179487
Workshops or courses not associated with undergraduate/graduate education 134 49.084249
Guidance or best practices created by organizations/experts in psychology 191 69.963370
Guidance or best practices created by organizations/experts outside psychology 68 24.908425
From researchers who are in/collaborate with my research group 228 83.516484
From researchers who are not in/do not collaborate with my research group 122 44.688645
Through social media 111 40.659341
Self education 203 74.358974
I have recieved no training 0 0.000000
Other 6 2.197802
In [21]:
#Display text entered by participants who selected "other".

df["education_methods_other_text"].value_counts()
Out[21]:
statistical workshops                                                                                 1
OBSSR training programs                                                                               1
I teach the undergraduate methods course, so textbooks are a source, but not quite self-education.    1
I create some myself                                                                                  1
peer review                                                                                           1
Class                                                                                                 1
Name: education_methods_other_text, dtype: int64

In [22]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

education_management_variables = {"education_management_courses_school": "Workshops or courses during undergraduate/graduate education",
                                  "education_management_courses_other": "Workshops or courses not associated with undergraduate/graduate education",
                                  "education_management_best_practices_psych": "Guidance or best practices created by organizations/experts in psychology",
                                  "education_management_best_practices_other": "Guidance or best practices created by organizations/experts outside psychology",
                                  "education_management_person_collab": "From researchers who are in/collaborate with my research group",
                                  "education_management_person_other": "From researchers who are not in/do not collaborate with my research group",
                                  "education_management_social_media": "Through social media",
                                  "education_management_self": "Self education",
                                  "education_management_none": "I have recieved no training",
                                  "education_management_other": "Other"}
                               
#Create dataframe containing both the number and percentage of responding participants who selected each response.
    
df_education_management = pd.DataFrame({"Number": df[list(education_management_variables.keys())].count(),
                                     "Percentage": df[list(education_management_variables.keys())].count()/df["education_method"].sum()*100})

df_education_management.set_index([list(education_management_variables.values())], inplace=True)
df_education_management 
Out[22]:
Number Percentage
Workshops or courses during undergraduate/graduate education 91 33.333333
Workshops or courses not associated with undergraduate/graduate education 64 23.443223
Guidance or best practices created by organizations/experts in psychology 95 34.798535
Guidance or best practices created by organizations/experts outside psychology 51 18.681319
From researchers who are in/collaborate with my research group 175 64.102564
From researchers who are not in/do not collaborate with my research group 84 30.769231
Through social media 87 31.868132
Self education 142 52.014652
I have recieved no training 19 6.959707
Other 8 2.930403
In [23]:
#Display text entered by participants who selected "other".

df["education_management_other_text"].value_counts()
Out[23]:
A co-author recommended Scott Long's book on data management with STATA. (Not a psychology book)                                    1
formal training sponsored by our university's continuing education staff; also from university admin who sit on ethics committee    1
I create them myself                                                                                                                1
From reviewers of a data paper I published                                                                                          1
workshops                                                                                                                           1
IRB videos and training courses                                                                                                     1
Note: Some of my education has occurred from observing poor practices                                                               1
From the "culture" and typical practices of labs I've been involved in                                                              1
Name: education_management_other_text, dtype: int64

14. Does your current organization or institution provide any of the following?

In [24]:
#This question contained multiple parts, participants gave one answer to each.

df_education_methods = pd.DataFrame({"Data Management": df["institution_resource_rdm"].value_counts(normalize=True)*100,
                                     "Data Sharing": df["institution_resource_sharing"].value_counts(normalize=True)*100,
                                     "Insfrastructure (IT)": df["institution_resource_it"].value_counts(normalize=True)*100})

df_education_methods
Out[24]:
Data Management Data Sharing Insfrastructure (IT)
Im not sure if my institution offers these services 39.926740 41.025641 27.472527
No 24.908425 30.036630 18.315018
Yes, and I have taken advantage of it. 17.582418 10.989011 36.263736
Yes, but I have not taken advantage of it. 17.582418 17.948718 17.948718

15. On a scale of 1 (Not limited) to 5 (Very limited), how much are your current data management practices limited by each of the following:

In [25]:
#This question contained multiple parts, participants gave one answer to each.

#Create dictionary containing responses.

limits_variables = {"limits_time": "The amount of time it takes",
                    "limits_cost": "The financial cost",
                    "limits_norms": "Lack of norms or best practices",
                    "limits_training": "Lack of training",
                    "limits_incentives": "Lack of professional incentives",
                    "limits_support": "Lack of institutional support",
                    "limits_guidance": "Lack of guidance from my PI/collaborators",
                    "limits_pi": "My supervisor requires I manage my data in a certain way",
                    "limits_data": "The charactoristics of my data limit what I can do",
                    "limits_knowldge": "I am unaware of best practices"} 

#Create dataframe containing descriptive statistics. 

#They're interesting and informative, but remember responses are ordinal.

df_limits_cont= df[list(limits_variables.keys())].describe().T
df_limits_cont["median"]= df[list(limits_variables.keys())].median().T
df_limits_cont.set_index([list(limits_variables.values())], inplace=True)
df_limits_cont
Out[25]:
count mean std min 25% 50% 75% max median
The amount of time it takes 268.0 3.041045 1.272805 1.0 2.00 3.0 4.0 5.0 3.0
The financial cost 267.0 2.142322 1.202428 1.0 1.00 2.0 3.0 5.0 2.0
Lack of norms or best practices 266.0 2.815789 1.252881 1.0 2.00 3.0 4.0 5.0 3.0
Lack of training 267.0 2.962547 1.237912 1.0 2.00 3.0 4.0 5.0 3.0
Lack of professional incentives 265.0 2.969811 1.353669 1.0 2.00 3.0 4.0 5.0 3.0
Lack of institutional support 264.0 2.723485 1.371650 1.0 1.75 3.0 4.0 5.0 3.0
Lack of guidance from my PI/collaborators 266.0 2.383459 1.269210 1.0 1.00 2.0 3.0 5.0 2.0
My supervisor requires I manage my data in a certain way 263.0 1.802281 1.065988 1.0 1.00 1.0 2.0 5.0 1.0
The charactoristics of my data limit what I can do 265.0 2.339623 1.367139 1.0 1.00 2.0 4.0 5.0 2.0
I am unaware of best practices 265.0 2.471698 1.237112 1.0 1.00 2.0 3.0 5.0 2.0
In [26]:
# Create dataframe displaying the percentage of participants who entered each value.

df_limits_cat = df[list(limits_variables.keys())]
df_limits_cat = df_limits_cat.apply(lambda x: pd.value_counts(x, normalize=True))*100
df_limits_cat = df_limits_cat.T
df_limits_cat.set_index([list(limits_variables.values())], inplace=True)
df_limits_cat                   
Out[26]:
1.0 2.0 3.0 4.0 5.0
The amount of time it takes 16.417910 15.671642 29.104478 25.000000 13.805970
The financial cost 40.823970 24.719101 18.352060 11.610487 4.494382
Lack of norms or best practices 19.924812 20.676692 25.939850 24.812030 8.646617
Lack of training 14.981273 22.097378 25.842697 25.842697 11.235955
Lack of professional incentives 18.490566 21.886792 19.245283 24.905660 15.471698
Lack of institutional support 25.000000 22.727273 20.833333 17.803030 13.636364
Lack of guidance from my PI/collaborators 33.834586 22.556391 21.052632 16.541353 6.015038
My supervisor requires I manage my data in a certain way 54.372624 22.433460 14.448669 6.083650 2.661597
The charactoristics of my data limit what I can do 40.377358 19.245283 13.962264 18.867925 7.547170
I am unaware of best practices 29.056604 23.396226 24.905660 16.603774 6.037736

16. On a scale of 1 (Not motivated) to 5 (Very motivated), how much are your current data management practices motivated by each of the following:

In [27]:
#This question contained multiple parts, participants gave one answer to each.

#Create dictionary containing responses.

motivations_variables = {"motivations_loss": "Prevent loss of data",
                         "motivations_continuity": "Ensure continuity as research team changes",
                         "motivations_compliance_funding": "Compliance with mandates from funder/publisher",
                         "motivations_compliance_ethics": "Compliance with legal/ethical frameworks",
                         "motivations_guidance": "Availability of guidance of best practices",
                         "motivations_training": "Availability of training",
                         "motivations_best_practice": "Awareness of best practices",
                         "motivations_support": "Institutional support",
                         "motivations_pi": "Guidance from PI/Collaborators",
                         "motivations_transparency": "Desire to foster research transparency",
                         "motivations_reproducibility": "Desire to foster reproducibility"} 
In [28]:
#Create dataframe containing descriptive statistics. 

#They're interesting and informative, but remember responses are ordinal.

df_motivations_cont= df[list(motivations_variables.keys())].describe().T
df_motivations_cont["median"]= df[list(motivations_variables.keys())].median().T
df_motivations_cont["mode"]= df[list(motivations_variables.keys())].mode().T

df_motivations_cont.set_index([list(motivations_variables.values())], inplace=True)
df_motivations_cont
Out[28]:
count mean std min 25% 50% 75% max median mode
Prevent loss of data 261.0 4.509579 0.844043 1.0 4.0 5.0 5.0 5.0 5.0 5.0
Ensure continuity as research team changes 262.0 3.973282 1.149401 1.0 3.0 4.0 5.0 5.0 4.0 5.0
Compliance with mandates from funder/publisher 261.0 3.218391 1.436545 1.0 2.0 3.0 5.0 5.0 3.0 5.0
Compliance with legal/ethical frameworks 261.0 3.731801 1.320294 1.0 3.0 4.0 5.0 5.0 4.0 5.0
Availability of guidance of best practices 260.0 2.869231 1.213747 1.0 2.0 3.0 4.0 5.0 3.0 3.0
Availability of training 262.0 2.549618 1.207979 1.0 2.0 3.0 3.0 5.0 3.0 3.0
Awareness of best practices 261.0 3.298851 1.174553 1.0 2.0 3.0 4.0 5.0 3.0 3.0
Institutional support 261.0 2.241379 1.255439 1.0 1.0 2.0 3.0 5.0 2.0 1.0
Guidance from PI/Collaborators 259.0 2.806950 1.297504 1.0 2.0 3.0 4.0 5.0 3.0 3.0
Desire to foster research transparency 261.0 4.172414 1.058376 1.0 4.0 5.0 5.0 5.0 5.0 5.0
Desire to foster reproducibility 262.0 4.236641 1.008258 1.0 4.0 5.0 5.0 5.0 5.0 5.0
In [29]:
# Create dataframe displaying the percentage of participants who entered each value.

df_motivations_cat = df[list(motivations_variables.keys())]
df_motivations_cat = df_motivations_cat.apply(lambda x: pd.value_counts(x, normalize=True))*100
df_motivations_cat = df_motivations_cat.T
df_motivations_cat.set_index([list(motivations_variables.values())], inplace=True)
df_motivations_cat
Out[29]:
1.0 2.0 3.0 4.0 5.0
Prevent loss of data 1.532567 1.532567 9.195402 19.923372 67.816092
Ensure continuity as research team changes 5.343511 6.106870 16.793893 29.389313 42.366412
Compliance with mandates from funder/publisher 18.007663 14.559387 20.689655 21.072797 25.670498
Compliance with legal/ethical frameworks 9.961686 8.812261 17.624521 25.287356 38.314176
Availability of guidance of best practices 16.153846 21.538462 32.307692 19.230769 10.769231
Availability of training 24.809160 23.282443 31.679389 12.595420 7.633588
Awareness of best practices 6.896552 18.390805 31.800766 23.754789 19.157088
Institutional support 36.015326 29.118774 17.624521 9.195402 8.045977
Guidance from PI/Collaborators 22.007722 19.305019 24.710425 23.938224 10.038610
Desire to foster research transparency 3.065134 4.214559 17.624521 22.605364 52.490421
Desire to foster reproducibility 1.526718 4.961832 17.557252 20.229008 55.725191

Data Collection

Description: The questions in this section concern activities and practices beginning with the collection of raw data from human participants and ending before data are processed and/or analyzed.

Back to table of contents


17. On a scale of 1-5, how would you rate the overall maturity of your data management practices during the data collection phase of a project?

In [30]:
#Print descriptive statistics. Interesting and informative, but remember responses are ordinal.

print("median   ", str(df["collect_mature_self"].median()))
print(df["collect_mature_self"].describe())
print(df["collect_mature_self"].value_counts())
median    3.0
count    236.000000
mean       3.288136
std        0.909540
min        1.000000
25%        3.000000
50%        3.000000
75%        4.000000
max        5.000000
Name: collect_mature_self, dtype: float64
3.0    94
4.0    83
2.0    35
5.0    17
1.0     7
Name: collect_mature_self, dtype: int64

18. On a scale of 1-5, how would you rate the data management practices for the field of psychology as a whole during the data collection phase of a project?

In [31]:
#Print descriptive statistics. Interesting and informative, but remember responses are ordinal.

print("median   ", str(df["collect_mature_field"].median()))
print(df["collect_mature_field"].describe())
print(df["collect_mature_field"].value_counts())
median    3.0
count    232.000000
mean       2.612069
std        0.942242
min        1.000000
25%        2.000000
50%        3.000000
75%        3.000000
max        5.000000
Name: collect_mature_field, dtype: float64
3.0    91
2.0    80
4.0    28
1.0    26
5.0     7
Name: collect_mature_field, dtype: int64

19. On a scale of 1-5, how would you rate your willingness to change your data management practices during the data collection phase of a project in response to new technologies, opportunities, or requirements?

In [32]:
#Print descriptive statistics. Interesting and informative, but remember responses are ordinal.

print("median   ", str(df["collect_change"].median()))
print(df["collect_change"].describe())
print(df["collect_change"].value_counts())
median    4.0
count    236.000000
mean       4.105932
std        0.904720
min        1.000000
25%        4.000000
50%        4.000000
75%        5.000000
max        5.000000
Name: collect_change, dtype: float64
4.0    106
5.0     87
3.0     28
2.0     11
1.0      4
Name: collect_change, dtype: int64

20. Which of the following describes how you collect data from human participants?

In [33]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

collect_how_variables = {"collect_how_lab": "Participants come to my lab to participate in an experiment",
                         "collect_how_travel": "I travel to my participants to collect data",
                         "collect_how_send": "I send my participants materials, which they return to me",
                         "collect_how_internet": "I collect data via the internet",
                         "collect_how_prompt": "I prompt participants to enter data",
                         "collect_how_secondary": "I examine records or data collected by others.",
                         "collect_how_other": "Other"}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_collect_how = pd.DataFrame({"Number": df[list(collect_how_variables.keys())].count(),
                               "Percentage": df[list(collect_how_variables.keys())].count()/df["collect_how"].sum()*100})

df_collect_how.set_index([list(collect_how_variables.values())], inplace=True)
df_collect_how 
Out[33]:
Number Percentage
Participants come to my lab to participate in an experiment 189 79.411765
I travel to my participants to collect data 86 36.134454
I send my participants materials, which they return to me 78 32.773109
I collect data via the internet 150 63.025210
I prompt participants to enter data 32 13.445378
I examine records or data collected by others. 104 43.697479
Other 2 0.840336
In [34]:
#Display text entered by participants who selected "other".

df["collect_how_other_text"].value_counts()
Out[34]:
Qualtrics, videoed interviews          1
Focus groups /interviews (in-house)    1
Name: collect_how_other_text, dtype: int64

21. Do the participants in your research include members of a “vulnerable” or “special” population (e.g. children, prisoners, clinical populations)?

In [35]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_data_type = pd.DataFrame({"Number": df["collect_vulnerable"].value_counts(),
                             "Percentage": df["collect_vulnerable"].value_counts(normalize=True)*100})
df_data_type
Out[35]:
Number Percentage
No 125 52.742616
Yes 112 47.257384

22. What software tools do you use to build experiments, ask questions, or collect data from your participants?

In [36]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

collect_software_variables = {"collect_software_eprime":"E-Prime",
                              "collect_software_builder":"Experiment Builder",
                              "collect_software_inquisit": "Inquisit",
                              "collect_software_presentation":"presentation",
                              "collect_software_psychopy":"psychopy",
                              "collect_software_matlab": "Matlab",
                              "collect_software_redcap": "REDcap",
                              "collect_software_qualtrics": "Qualtrics",
                              "collect_software_custom": "Custom code",
                              "collect_software_other": "Other",
                              "collect_software_none": "I don't use software for this purpose"}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_collect_software = pd.DataFrame({"Number": df[list(collect_software_variables.keys())].count(),
                                    "Percentage": df[list(collect_software_variables.keys())].count()/df["collect_software"].sum()*100})

df_collect_software.set_index([list(collect_software_variables.values())], inplace=True)
df_collect_software 
Out[36]:
Number Percentage
E-Prime 52 22.127660
Experiment Builder 6 2.553191
Inquisit 22 9.361702
presentation 13 5.531915
psychopy 46 19.574468
Matlab 62 26.382979
REDcap 34 14.468085
Qualtrics 135 57.446809
Custom code 52 22.127660
Other 52 22.127660
I don't use software for this purpose 10 4.255319
In [37]:
#Display text entered by participants who selected "other".

df["collect_software_other_text"].value_counts()
Out[37]:
eye tracking specific software, software for collecting looking from infants (Habit)                                        1
COINS                                                                                                                       1
PyHab                                                                                                                       1
Survey Monkey, professional survey organizations                                                                            1
Tobii                                                                                                                       1
hospital records                                                                                                            1
JSPsych                                                                                                                     1
Superlab                                                                                                                    1
Twilio                                                                                                                      1
videos                                                                                                                      1
Custom tablet applications                                                                                                  1
LiveCode                                                                                                                    1
video/audio                                                                                                                 1
Online: bespoke software; children: direct interaction with experimenter                                                    1
Paper/Pencil tests (e.g. Trail Making Task))                                                                                1
Free survey platform (LimeSurvey & Soscisurvey)                                                                             1
brain imaging                                                                                                               1
Jisc online survey                                                                                                          1
Adobe LiveCycle Forms                                                                                                       1
Socratos, LimeSurvey                                                                                                        1
Visual Basic                                                                                                                1
formr; zTree; SoPHIE                                                                                                        1
UK website                                                                                                                  1
We program our experiments using programing languages like Python and others                                                1
CheckBox                                                                                                                    1
JsPsych, Adobe LiveCycle forms                                                                                              1
Soscisurvey (online questionnaire platform based in germany)                                                                1
Opinio                                                                                                                      1
.                                                                                                                           1
Roqua                                                                                                                       1
SurveyMonkey                                                                                                                1
EMA software vendors                                                                                                        1
jspsych                                                                                                                     1
Gorilla Experiment Builder (gorilla.sc)                                                                                     1
Gorilla                                                                                                                     1
Medialab/DirectRT                                                                                                           1
Unipark (similar to qualtrics)                                                                                              1
Surveymonkey                                                                                                                1
Superlab                                                                                                                    1
Proprietary software developed by employer                                                                                  1
Real Basic                                                                                                                  1
Opensesame                                                                                                                  1
custom software                                                                                                             1
Question Pro                                                                                                                1
LIMESURVEY                                                                                                                  1
Opensesame                                                                                                                  1
Tatool Web                                                                                                                  1
Superlab,tobii                                                                                                              1
Trivox                                                                                                                      1
A few of our measures are scored using computers.  All data are hand-entered, though, whether scored by computer or not.    1
SoSci Survey                                                                                                                1
Real basic, now called Xojo                                                                                                 1
Name: collect_software_other_text, dtype: int64

23. What type of data do you collect from your participants? (Select all that apply)

In [38]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

collect_data_variables = {"collect_data_av":"Audio/visual recordings",
                          "collect_data_demographics":"Demographic data",
                          "collect_data_clinical":"Clinical or Medical data",
                          "collect_data_scales_quantitative":"Quantitative data from questionnaires",
                          "collect_data_scales_qualitative":"Qualitative data from questionnaires",
                          "collect_data_behavioral":"Behavioral data",
                          "collect_data_qualitative":"Qualitative data",
                          "collect_data_neurpsych":"Neuropsychological or aptitude tests",
                          "collect_data_neuroimaging":"Neuroimaging data",
                          "collect_data_writing":"Data from written documents",
                          "collect_data_physiology":"Physiological data",
                          "collect_data_genetic": "Genetic/molecular data",
                          "collect_data_eye_tracking":"Eye tracking/pupillometry data",
                          "collect_data_other":"Other"}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_collect_data = pd.DataFrame({"Number": df[list(collect_data_variables.keys())].count(),
                               "Percentage": df[list(collect_data_variables.keys())].count()/df["collect_data"].sum()*100})

df_collect_data.set_index([list(collect_data_variables.values())], inplace=True)
df_collect_data
Out[38]:
Number Percentage
Audio/visual recordings 98 41.176471
Demographic data 231 97.058824
Clinical or Medical data 79 33.193277
Quantitative data from questionnaires 208 87.394958
Qualitative data from questionnaires 100 42.016807
Behavioral data 190 79.831933
Qualitative data 45 18.907563
Neuropsychological or aptitude tests 51 21.428571
Neuroimaging data 72 30.252101
Data from written documents 26 10.924370
Physiological data 59 24.789916
Genetic/molecular data 25 10.504202
Eye tracking/pupillometry data 59 24.789916
Other 11 4.621849
In [39]:
#Display text entered by participants who selected "other".

df["collect_data_other_text"].value_counts()
Out[39]:
Motion capture data                           1
Wide range of observable outcome variables    1
Interviews with significant others            1
saliva for hormone assays                     1
criminal justice records                      1
tdcs                                          1
wearable devices (e.g., Fitbit)               1
administrative health records                 1
institutional data (e.g., student grades)     1
Cerebrospinal fluid                           1
Academic transcript/government records        1
Name: collect_data_other_text, dtype: int64

24. What additional information do you need to preserve for a research project that is connected to the data collected from your participants?

In [40]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

collect_information_variables={"collect_information_session":"Information about the data collection session",
                               "collect_information_paradigm":"Research protocol/paradigm-related information",
                               "collect_information_stimuli":"Research-related stimuli",
                               "collect_information_text":"Text of questionnaires, scales,etc",
                               "collect_information_scripts":"Computer code used for data collection",
                               "collect_information_coding":"Coding materials",
                               "collect_information_consent":"Informed consent-related documentation",
                               "collect_information_other":"Other"}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

collect_information_data = pd.DataFrame({"Number": df[list(collect_information_variables.keys())].count(),
                                         "Percentage": df[list(collect_information_variables.keys())].count()/df["collect_information"].sum()*100})

collect_information_data.set_index([list(collect_information_variables.values())], inplace=True)
collect_information_data
Out[40]:
Number Percentage
Information about the data collection session 178 76.068376
Research protocol/paradigm-related information 151 64.529915
Research-related stimuli 162 69.230769
Text of questionnaires, scales,etc 192 82.051282
Computer code used for data collection 150 64.102564
Coding materials 166 70.940171
Informed consent-related documentation 205 87.606838
Other 4 1.709402
In [41]:
#Display text entered by participants who selected "other".

df["collect_information_other_text"].value_counts()
Out[41]:
Case summaries, medical records              1
Questionnaire modification and versioning    1
Payment information (if applicable)          1
Scoring keys for the quesionnaires           1
Name: collect_information_other_text, dtype: int64

In [42]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

collect_code_variables={"collect_code_ir":"Institutional repository",
                        "collect_code_github":"Software-specific hosting service",
                        "collect_code_osf":"Open Science Framework (OSF)",
                        "collect_code_repo":"General purpose repository ",
                        "collect_code_article":"Journal article ",
                        "collect_code_website":"Website",
                        "collect_code_other":"Other",
                        "collect_code_none":"None"}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_collect_code = pd.DataFrame({"Number": df[list(collect_code_variables.keys())].count(),
                               "Percentage": df[list(collect_code_variables.keys())].count()/df["collect_information"].sum()*100})

df_collect_code.set_index([list(collect_code_variables.values())], inplace=True)
df_collect_code
Out[42]:
Number Percentage
Institutional repository 24 10.256410
Software-specific hosting service 47 20.085470
Open Science Framework (OSF) 99 42.307692
General purpose repository 7 2.991453
Journal article 45 19.230769
Website 39 16.666667
Other 12 5.128205
None 85 36.324786
In [43]:
#Display text entered by participants who selected "other".

df["collect_code_other_text"].value_counts()
Out[43]:
Using another institutional repository                                                                                   1
Australian government data archive                                                                                       1
shared stimuli-presentation and data collection code when requested for others to create similar protocols; via email    1
Tatool Web                                                                                                               1
will share syntax by email                                                                                               1
In multiple direct email responses                                                                                       1
Emailed (i.e., copy and paste into body of email)to collaborators                                                        1
on request                                                                                                               1
Dropbox                                                                                                                  1
In codebooks that can be requested                                                                                       1
Upon request                                                                                                             1
via email to specific requesters                                                                                         1
Name: collect_code_other_text, dtype: int64

26. Which of the following describes how you store and analyze your data?

In [44]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

collect_storage_variables={"collect_storage_own_both":"I use my own machine(s) to store and analyze my data",
                           "collect_storage_own_analysis":"I use my own machine(s) to analyze my data, but I store my data on a shared drive.",
                           "collect_storage_workstation":"I use a workstation that I share with other researchers to analyze and store my data.",
                           "collect_storage_server":"I log in to my lab’s shared server or cluster to analyze and store my data.",
                           "collect_storage_none":"I do not analyze or store my data electronically.",
                           "collect_storage_other":"Other"}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_collect_storage = pd.DataFrame({"Number": df[list(collect_storage_variables.keys())].count(),
                                   "Percentage": df[list(collect_storage_variables.keys())].count()/df["collect_storage"].sum()*100})

df_collect_storage.set_index([list(collect_storage_variables.values())], inplace=True)
df_collect_storage
Out[44]:
Number Percentage
I use my own machine(s) to store and analyze my data 154 64.978903
I use my own machine(s) to analyze my data, but I store my data on a shared drive. 174 73.417722
I use a workstation that I share with other researchers to analyze and store my data. 37 15.611814
I log in to my lab’s shared server or cluster to analyze and store my data. 94 39.662447
I do not analyze or store my data electronically. 0 0.000000
Other 9 3.797468
In [45]:
#Display text entered by participants who selected "other".

df["collect_storage_other_text"].value_counts()
Out[45]:
also back up computers regularly with back-up hard drives     1
Research assistants analyze my data                           1
external backup harddrives                                    1
University encrypted PC and university filespace              1
I store data on an institutional repository                   1
Encrypted Cloud storage (Sync)                                1
Institutional server                                          1
University-supported tape back-up                             1
Institutional OneDrive                                        1
Name: collect_storage_other_text, dtype: int64

27. What system(s) do you use to keep your raw digital files/data organized?

In [46]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

collect_organize_variables={"collect_organize_names":"Standardized file naming",
                            "collect_organize_structures":"Standardized file organization",
                            "collect_organize_notebook":"Lab notebook, data dictionary, codebook",
                            "collect_organize_general":"General procedures that aren't standardized or recorded",
                            "collect_organize_none":"No procedures",
                            "collect_organize_na":"Not applicable",
                            "collect_organize_other":"Other"}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_collect_organize = pd.DataFrame({"Number": df[list(collect_organize_variables.keys())].count(),
                                    "Percentage": df[list(collect_organize_variables.keys())].count()/df["collect_organize"].sum()*100})

df_collect_organize.set_index([list(collect_organize_variables.values())], inplace=True)
df_collect_organize
Out[46]:
Number Percentage
Standardized file naming 126 53.164557
Standardized file organization 130 54.852321
Lab notebook, data dictionary, codebook 80 33.755274
General procedures that aren't standardized or recorded 90 37.974684
No procedures 24 10.126582
Not applicable 1 0.421941
Other 5 2.109705
In [47]:
#Display text entered by participants who selected "other".

df["collect_organize_other_text"].value_counts()
Out[47]:
I process all of my data in R (with some parts of the processing pipeline occurring in Excel). The script contains detailed notes on how to run it, also for the bits that are done in Excel. I use the same basic script for all of my experiments, which I tailor to the particular experiment (for example if a particular task wasn't included or I had two groups of participants whose data needed to be processed separately). These scripts are uploaded to the OSF along with the raw data so that others can scrutinise it (if they want) and/or reproduce my results.    1
documenting data structure when uploading to OSF                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    1
For some data types more than others                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                1
rely on Qualtrics as back-up of raw data, named by date + protocol folder and associated with original protocol still                                                                                                                                                                                                                                                                                                                                                                                                                                                               1
All organization is done by code, which I keep. There is no further documentation though.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           1
Name: collect_organize_other_text, dtype: int64

28. Does everyone in your lab or research group use similar system(s) for organizing their raw digital files/data?

In [48]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_collect_organize_lab = pd.DataFrame({"Number": df["collect_organize_lab"].value_counts(),
                                        "Percentage": df["collect_organize_lab"].value_counts(normalize=True)*100})
df_collect_organize_lab
Out[48]:
Number Percentage
Yes 92 38.818565
No 82 34.599156
Im not sure. 53 22.362869
Not applicable 10 4.219409

29. How do you backup or secure your digital files/data?

In [49]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

collect_backup_variables={"collect_backup_digitize":"Digitizing non-digital files/data.",
                          "collect_backup_cabinet":"Storing non-digital data in a secure location",
                          "collect_backup_hard_drive":"External hard drive",
                          "collect_backup_backup_manual":"Manually backing up my local machine",
                          "collect_backup_backup_automatic":"Automatically backing up my local machine",
                          "collect_backup_server_lab":"Using a lab-owned server",
                          "collect_backup_server_department":"Local server (Department)",
                          "collect_backup_server_institution":"Local server (Institution)",
                          "collect_backup_cloud":"Upload to the cloud",
                          "collect_backup_ir":"Deposit it to my institutional repository",
                          "collect_backup_other":"Other",
                          "collect_backup_none":"I do not back up my files"}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_backup_organize = pd.DataFrame({"Number": df[list(collect_backup_variables.keys())].count(),
                                   "Percentage": df[list(collect_backup_variables.keys())].count()/df["collect_backup"].sum()*100})

df_backup_organize.set_index([list(collect_backup_variables.values())], inplace=True)
df_backup_organize                        
Out[49]:
Number Percentage
Digitizing non-digital files/data. 77 32.352941
Storing non-digital data in a secure location 133 55.882353
External hard drive 89 37.394958
Manually backing up my local machine 82 34.453782
Automatically backing up my local machine 55 23.109244
Using a lab-owned server 52 21.848739
Local server (Department) 40 16.806723
Local server (Institution) 80 33.613445
Upload to the cloud 139 58.403361
Deposit it to my institutional repository 30 12.605042
Other 9 3.781513
I do not back up my files 1 0.420168

30. Does everyone in your lab or research group use similar system(s) for backing up their digital files/data?

In [50]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_collect_backup_lab = pd.DataFrame({"Number": df["collect_backup_lab"].value_counts(),
                                      "Percentage": df["collect_backup_lab"].value_counts(normalize=True)*100})
df_collect_backup_lab
Out[50]:
Number Percentage
Yes 122 51.476793
Im not sure. 69 29.113924
No 46 19.409283

31. How many backup copies do you keep of your digital files/data?

In [51]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_collect_backup_copy = pd.DataFrame({"Number": df["collect_backup_copy"].value_counts(),
                                       "Percentage": df["collect_backup_copy"].value_counts(normalize=True)*100})
df_collect_backup_copy
Out[51]:
Number Percentage
1 93 39.075630
2 84 35.294118
3 29 12.184874
More than 3 18 7.563025
I do not keep any backup copies of my digital files/data. 14 5.882353

32. On a scale of 1 (No need) to 5 (High level of need), please indicate your level of need for training or education for each of the following:

In [52]:
#This question contained multiple parts, participants gave one answer to each.

#Create dictionary containing responses.

collect_education_variables={"collect_education_dmp":"Completing a data management plan (DMP)",
                             "collect_education_storing":"Best practices for storing and backing up data.",
                             "collect_education_security":"Ensuring the security of sensitive data ",
                             "collect_education_organizing":"Organizing data",
                             "collect_education_documenting":"Documenting data"}
In [53]:
#Create dataframe containing descriptive statistics. 

#They're interesting and informative, but remember responses are ordinal.

df_collect_education_cont= df[list(collect_education_variables.keys())].describe().T
df_collect_education_cont["median"]= df[list(collect_education_variables.keys())].median().T
df_collect_education_cont.set_index([list(collect_education_variables.values())], inplace=True)
df_collect_education_cont
Out[53]:
count mean std min 25% 50% 75% max median
Completing a data management plan (DMP) 238.0 3.000000 1.315279 1.0 2.0 3.0 4.0 5.0 3.0
Best practices for storing and backing up data. 238.0 3.155462 1.254944 1.0 2.0 3.0 4.0 5.0 3.0
Ensuring the security of sensitive data 238.0 2.894958 1.341285 1.0 2.0 3.0 4.0 5.0 3.0
Organizing data 238.0 3.016807 1.353122 1.0 2.0 3.0 4.0 5.0 3.0
Documenting data 238.0 3.105042 1.347562 1.0 2.0 3.0 4.0 5.0 3.0
In [54]:
# Create dataframe displaying the percentage of participants who entered each value.

df_collect_education_cont = df[list(collect_education_variables.keys())]
df_collect_education_cont = df_collect_education_cont.apply(lambda x: pd.value_counts(x, normalize=True))*100
df_collect_education_cont = df_collect_education_cont.T
df_collect_education_cont.set_index([list(collect_education_variables.values())], inplace=True)
df_collect_education_cont
Out[54]:
1.0 2.0 3.0 4.0 5.0
Completing a data management plan (DMP) 18.487395 16.806723 24.789916 26.050420 13.865546
Best practices for storing and backing up data. 10.504202 22.268908 26.470588 22.689076 18.067227
Ensuring the security of sensitive data 18.067227 25.210084 21.848739 18.907563 15.966387
Organizing data 18.067227 19.747899 21.008403 24.789916 16.386555
Documenting data 15.126050 20.588235 22.689076 21.848739 19.747899
In [55]:
#Create a stacked bar chart to display the data.

with sns.color_palette("Greens"):
    bar_education_collect_stacked = df_collect_education_cont.plot(kind='barh', stacked=True, legend=False)

#Clean up the formatting
    
bar_education_collect_stacked.invert_yaxis()
bar_education_collect_stacked.set(xlim=(0, 100))
plt.xlabel("Percentage")
sns.despine(offset=10)

plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.axvline(25, color="k", linestyle="--");
plt.axvline(50, color="k", linestyle="--");
plt.axvline(75, color="k", linestyle="--");

Data Analysis

Description: The questions in this section concern activities and practices starting when data is processed, cleaned, or inspected, continuing through the application of descriptive and/or inferential statistics, and ending before the data is made available or described in a presentation or scholarly publication.

Back to table of contents


33. On a scale of 1-5, how would you rate the overall maturity of your data management practices during the data analysis phase of a project.

In [56]:
#Print descriptive statistics. Interesting and informative, but remember responses are ordinal.

print("median   ", str(df["analyze_mature_self"].median()))
print(df["analyze_mature_self"].describe())
print(df["analyze_mature_self"].value_counts())
median    3.0
count    231.000000
mean       3.354978
std        0.957560
min        1.000000
25%        3.000000
50%        3.000000
75%        4.000000
max        5.000000
Name: analyze_mature_self, dtype: float64
3.0    93
4.0    75
2.0    29
5.0    26
1.0     8
Name: analyze_mature_self, dtype: int64

34. On a scale of 1-5, how would you rate the data management practices for the field of psychology as a whole during the data analysis phase of a project.

In [57]:
#Print descriptive statistics. Interesting and informative, but remember responses are ordinal.

print("median    ", str(df["analyze_mature_field"].median()))
print(df["analyze_mature_field"].describe())
print(df["analyze_mature_field"].value_counts())
median     3.0
count    225.000000
mean       2.684444
std        0.897903
min        1.000000
25%        2.000000
50%        3.000000
75%        3.000000
max        5.000000
Name: analyze_mature_field, dtype: float64
3.0    91
2.0    72
4.0    39
1.0    21
5.0     2
Name: analyze_mature_field, dtype: int64

35. On a scale of 1-5, how would you rate your willingness to change your data management practices during the data analysis phase in response to new technologies, opportunities, or requirements.

In [58]:
#Print descriptive statistics. Interesting and informative, but remember responses are ordinal.

print("median   ", str(df["analyze_change"].median()))
print(df["analyze_change"].describe())
print(df["analyze_change"].value_counts())
median    4.0
count    231.000000
mean       4.194805
std        0.850150
min        1.000000
25%        4.000000
50%        4.000000
75%        5.000000
max        5.000000
Name: analyze_change, dtype: float64
5.0    96
4.0    95
3.0    32
2.0     5
1.0     3
Name: analyze_change, dtype: int64

36. If you received formal training in statistics and/or data analysis, what software tools were you taught how to use?

In [59]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

analyze_tools_taught_variables={"analyze_tools_taught_excel":"Excel",
                                "analyze_tools_taught_jasp":"JASP",
                                "analyze_tools_taught_jamovi":"jamovi",
                                "analyze_tools_taught_lisrel":"Lisrel",
                                "analyze_tools_taught_literate":"Literate programming tools",
                                "analyze_tools_taught_matlab":"MATLAB",
                                "analyze_tools_taught_mplus":"MPlus",
                                "analyze_tools_taught_python":"python",
                                "analyze_tools_taught_sas":"SAS",
                                "analyze_tools_taught_spss":"SPSS",
                                "analyze_tools_taught_stata":"STATA",
                                "analyze_tools_taught_systat":"SYSTAT",
                                "analyze_tools_taught_r":"R",
                                "analyze_tools_taught_other":"Other",
                                "analyze_tools_taught_no_tools":"I was not taught to use software tools",
                                "analyze_tools_taught_no_training":"I have recieved no formal training"}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_analyze_tools_taught = pd.DataFrame({"Number": df[list(analyze_tools_taught_variables.keys())].count(),
                               "Percentage": df[list(analyze_tools_taught_variables.keys())].count()/df["analyze_tools_taught"].sum()*100})

df_analyze_tools_taught.set_index([list(analyze_tools_taught_variables.values())], inplace=True)
df_analyze_tools_taught                                         
Out[59]:
Number Percentage
Excel 109 46.781116
JASP 20 8.583691
jamovi 2 0.858369
Lisrel 26 11.158798
Literate programming tools 4 1.716738
MATLAB 50 21.459227
MPlus 56 24.034335
python 18 7.725322
SAS 42 18.025751
SPSS 203 87.124464
STATA 19 8.154506
SYSTAT 5 2.145923
R 115 49.356223
Other 16 6.866953
I was not taught to use software tools 7 3.004292
I have recieved no formal training 6 2.575107
In [60]:
#Display text entered by participants who selected "other".

df["analyze_tools_taught_other_text"].value_counts()
Out[60]:
Statistica                                       2
JMP                                              2
OpenMx                                           1
Jmp                                              1
eqs                                              1
AFNI (Analysis of Functional NeuroImages         1
BMDP                                             1
Minitab                                          1
BMDP, 1970's                                     1
SPM, AFNI                                        1
HLM                                              1
HLM for windows, openBUGS, & CEFA                1
Fortran (see above, 42 years in the business)    1
Dept specific program (now defunct)              1
Name: analyze_tools_taught_other_text, dtype: int64

37. What software tools do you currently use to analyze your data?

In [61]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

analyze_tools_use_variables={"analyze_tools_use_excel":"Excel",
                             "analyze_tools_use_jasp":"JASP",
                             "analyze_tools_use_jamovi":"jamovi",
                             "analyze_tools_use_lisrel":"Lisrel",
                             "analyze_tools_use_literate":"Literate programming tools",
                             "analyze_tools_use_matlab":"MATLAB",
                             "analyze_tools_use_mplus":"MPlus",
                             "analyze_tools_use_python":"python",
                             "analyze_tools_use_sas":"SAS",
                             "analyze_tools_use_spss":"SPSS",
                             "analyze_tools_use_stata":"STATA",
                             "analyze_tools_use_systat":"SYSTAT",
                             "analyze_tools_use_r":"R",
                             "analyze_tools_use_other":"Other",
                             "analyze_tools_use_no_tools":"I was not taught to use software tools"}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_analyze_tools_use = pd.DataFrame({"Number": df[list(analyze_tools_use_variables.keys())].count(),
                                     "Percentage": df[list(analyze_tools_use_variables.keys())].count()/df["analyze_tools_use"].sum()*100})

df_analyze_tools_use.set_index([list(analyze_tools_use_variables.values())], inplace=True)
df_analyze_tools_use                            
Out[61]:
Number Percentage
Excel 113 48.497854
JASP 31 13.304721
jamovi 12 5.150215
Lisrel 4 1.716738
Literate programming tools 4 1.716738
MATLAB 56 24.034335
MPlus 53 22.746781
python 25 10.729614
SAS 16 6.866953
SPSS 148 63.519313
STATA 15 6.437768
SYSTAT 1 0.429185
R 157 67.381974
Other 16 6.866953
I was not taught to use software tools 0 0.000000
In [62]:
#Display text entered by participants who selected "other".

df["analyze_tools_use_other_text"].value_counts()
Out[62]:
Statistica                                                         2
JMP                                                                2
Fortran                                                            1
Others analyze data on my behalf, using Stata and MPlus usually    1
NVIVO, LIWC                                                        1
AFNI (Analysis of Functional NeuroImages                           1
GraphPad                                                           1
self-written C++ & Pascal programs                                 1
Network Analysis                                                   1
OpenMx                                                             1
seldomly I use:  HLM, QROC, SAS, MPlus                             1
HLM                                                                1
Neuroimaging software (FSL, SPM, Freesurfer, etc)                  1
AFNI, SPM                                                          1
Name: analyze_tools_use_other_text, dtype: int64

38. Does everyone in your lab or research group use the same software tools to analyze their data?

In [63]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_analyze_tools_lab = pd.DataFrame({"Number": df["analyze_tools_lab"].value_counts(),
                                     "Percentage": df["analyze_tools_lab"].value_counts(normalize=True)*100})

df_analyze_tools_lab
Out[63]:
Number Percentage
Some tools are used by everyone, but sometimes different people or different projects use additional tools. 152 65.517241
No, there is no standardization in terms of what software tools my lab or research group members use. 48 20.689655
Yes, everyone uses the same software tool(s). 29 12.500000
Im not sure. 3 1.293103

39. Do you use the same version(s) of software or tools to analyze data throughout the complete duration of a specific project?

In [64]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_analyze_tools_vc = pd.DataFrame({"Number": df["analyze_tools_vc"].value_counts(),
                                    "Percentage": df["analyze_tools_vc"].value_counts(normalize=True)*100})

df_analyze_tools_vc
Out[64]:
Number Percentage
Sometimes, when possible 113 48.706897
No 67 28.879310
Yes, always 36 15.517241
Im not sure 16 6.896552

40. Do you create or adapt custom computer code or scripts in order to analyze data?

In [65]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_analyze_code = pd.DataFrame({"Number": df["analyze_code"].value_counts(),
                                "Percentage": df["analyze_code"].value_counts(normalize=True)*100})

df_analyze_code
Out[65]:
Number Percentage
Yes, I use both code I have created myself and adapt code written by others. 95 40.772532
Yes, I primarily create my own code. 93 39.914163
Yes, I primarily adapt code written by others. 25 10.729614
No 18 7.725322
Im not sure. 2 0.858369

In [66]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

analyze_code_shared_variables={"analyze_code_shared_ir":"Yes. In an institutional repository",
                               "analyze_code_shared_github":"Yes. Using a software specific repository",
                               "analyze_code_shared_osf":"Yes. Using the Open Science Framework",
                               "analyze_code_shared_repo":"Yes. Using a general purpose repository",
                               "analyze_code_shared_article":"Yes. As part of a journal article",
                               "analyze_code_shared_website":"Yes. On a lab or project website",
                               "analyze_code_shared_other":"Yes. Other.",
                               "analyze_code_shared_none":"No"}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_analyze_code_shared = pd.DataFrame({"Number": df[list(analyze_code_shared_variables.keys())].count(),
                                       "Percentage": df[list(analyze_code_shared_variables.keys())].count()/df["analyze_code_shared"].sum()*100})

df_analyze_code_shared.set_index([list(analyze_code_shared_variables.values())], inplace=True)
df_analyze_code_shared                                  
Out[66]:
Number Percentage
Yes. In an institutional repository 17 7.296137
Yes. Using a software specific repository 46 19.742489
Yes. Using the Open Science Framework 96 41.201717
Yes. Using a general purpose repository 7 3.004292
Yes. As part of a journal article 38 16.309013
Yes. On a lab or project website 25 10.729614
Yes. Other. 12 5.150215
No 86 36.909871
In [67]:
#Display text entered by participants who selected "other".

df["analyze_code_shared_other_text"].value_counts()
Out[67]:
emailed to collaborators                                                                                       1
I've emailed SAS and SPSS syntax to other sites of multi-site studies.  I've also received syntax this way.    1
codeocean.com                                                                                                  1
on the cloud                                                                                                   1
Only within the lab                                                                                            1
when requested, informal                                                                                       1
email                                                                                                          1
via email (pasted into body of email)                                                                          1
Email                                                                                                          1
Email when requested                                                                                           1
Only on servers maintained by my lab/institution                                                               1
RPubs                                                                                                          1
Name: analyze_code_shared_other_text, dtype: int64

42. How do you typically document your activities during the data analysis phase of a project?

In [68]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

analyze_docs_variables={"analyze_docs_paper":"Physical notebook or on paper",
                        "analyze_docs_word":"word processing or note-taking program ",
                        "analyze_docs_project_management":"collaborative project management system ",
                        "analyze_docs_eln":"electronic lab notebook",
                        "analyze_docs_literate":"literate programming tools",
                        "analyze_docs_vc":"version control system",
                        "analyze_docs_wiki":"lab wiki",
                        "analyze_docs_readme":"ReadMe files",
                        "analyze_docs_none":"I do not document my activities in any systematic way.",
                        "analyze_docs_other":"Other"}

df_analyze_docs = pd.DataFrame({"Number": df[list(analyze_docs_variables.keys())].count(),
                                "Percentage": df[list(analyze_docs_variables.keys())].count()/df["analyze_docs"].sum()*100})

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_analyze_docs.set_index([list(analyze_docs_variables.values())], inplace=True)
df_analyze_docs
Out[68]:
Number Percentage
Physical notebook or on paper 54 23.275862
word processing or note-taking program 125 53.879310
collaborative project management system 44 18.965517
electronic lab notebook 5 2.155172
literate programming tools 53 22.844828
version control system 31 13.362069
lab wiki 4 1.724138
ReadMe files 44 18.965517
I do not document my activities in any systematic way. 36 15.517241
Other 35 15.086207
In [69]:
#Display text entered by participants who selected "other".

df["analyze_docs_other_text"].value_counts()
Out[69]:
I embed notes in analysis code files (eg MATLAB) and in manuscript preparation files (eg LaTex)                                                                                  1
comments in R code                                                                                                                                                               1
Code comments                                                                                                                                                                    1
I keep notes in the comments section of the code as I go                                                                                                                         1
I would use github but it's not allowed to be installed on the lab computers, additionally I always try to always use comments in my scripts which document my analysis steps    1
Comments within syntax                                                                                                                                                           1
Commenting code                                                                                                                                                                  1
follow standard lab checklist                                                                                                                                                    1
Excel spreadsheets                                                                                                                                                               1
I keep notes in the computer code and syntax.                                                                                                                                    1
Google Sheets                                                                                                                                                                    1
The note is a part of the statistical software syntax.                                                                                                                           1
I take notes on the syntax or input files.                                                                                                                                       1
I comment the R code                                                                                                                                                             1
I keep notes in the syntax                                                                                                                                                       1
I make notes within my analysis code                                                                                                                                             1
Comments in R                                                                                                                                                                    1
I keep notes in the programming tools (e.g. SPSS, R)                                                                                                                             1
Keep copies of my SPSS syntax.                                                                                                                                                   1
In line comments in syntax                                                                                                                                                       1
I both create additional pages in the files for notes and code the analyses/checks and save that code for documentation with notes in the code                                   1
Annotations in analysis script                                                                                                                                                   1
Notes within scripts                                                                                                                                                             1
Excel "data books"                                                                                                                                                               1
SPSS syntax                                                                                                                                                                      1
I document within stat software code                                                                                                                                             1
Remarks in analysis code                                                                                                                                                         1
comments in my R script                                                                                                                                                          1
I write notes in the same R script that I write the code for my analyses in. I also keep an Excel workbook to which I copy the analysis output from R.                           1
Record in the syntax files of my analysis                                                                                                                                        1
Document workflow using a standardized document, that asks users to document each stave of the analysis workflow.                                                                1
versions of syntax                                                                                                                                                               1
I comment directly in my code.                                                                                                                                                   1
I extensively annotate my syntax files to include decisions, reasoning, meetings, etc.                                                                                           1
I keep notes in R                                                                                                                                                                1
Name: analyze_docs_other_text, dtype: int64

43. Does everyone in the lab use similar system(s) for documenting their activities during the data analysis phase of a project?

In [70]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_analyze_docs_lab = pd.DataFrame({"Number": df["analyze_docs_lab"].value_counts(),
                                    "Percentage": df["analyze_docs_lab"].value_counts(normalize=True)*100})

df_analyze_docs_lab
Out[70]:
Number Percentage
No 146 63.478261
Yes 57 24.782609
Not applicable 27 11.739130

44. Do you believe someone with a similar level of expertise could recreate your data analysis steps from the documentation and notes your create as you are analyzing your data (without you being present)?

In [71]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_analyze_docs_others = pd.DataFrame({"Number": df["analyze_docs_others"].value_counts(),
                                       "Percentage": df["analyze_docs_others"].value_counts(normalize=True)*100})

df_analyze_docs_others
Out[71]:
Number Percentage
Yes. Someone could recreate both my data cleaning/coding and analysis steps. 137 58.798283
Someone could recreate my data analysis but not cleaning/coding steps. 42 18.025751
No. I would have to be present. 31 13.304721
Im not sure. 13 5.579399
Someone could recreate my cleaning/coding but not analysis steps. 9 3.862661
No. Another researcher could not do this, even if I were present. 1 0.429185

45. On a scale of 1 (No Need) to 5 (High level of need), please indicate your level of need for training or education for each of the following

In [72]:
#This question contained multiple parts, participants gave one answer to each.

#Create dictionary containing responses.

analyze_education_variables={"analyze_education_methods":"Research design/methods",
                             "analyze_education_tools":"Use of software tools", 
                             "analyze_education_organizing":"Organizing data for analysis", 
                             "analyze_education_documenting":"Documenting analysis decisions", 
                             "analyze_education_sharing":"Sharing code in a useful manner"}
In [73]:
#Create dataframe containing descriptive statistics. 

#They're interesting and informative, but remember responses are ordinal.

df_analyze_education_cont= df[list(analyze_education_variables.keys())].describe().T
df_analyze_education_cont["median"]= df[list(analyze_education_variables.keys())].median().T
df_analyze_education_cont.set_index([list(analyze_education_variables.values())], inplace=True)
df_analyze_education_cont
Out[73]:
count mean std min 25% 50% 75% max median
Research design/methods 233.0 2.115880 1.082524 1.0 1.0 2.0 3.0 5.0 2.0
Use of software tools 232.0 2.698276 1.200432 1.0 2.0 3.0 4.0 5.0 3.0
Organizing data for analysis 232.0 2.487069 1.176908 1.0 2.0 2.0 3.0 5.0 2.0
Documenting analysis decisions 233.0 3.094421 1.242083 1.0 2.0 3.0 4.0 5.0 3.0
Sharing code in a useful manner 232.0 3.241379 1.231764 1.0 2.0 3.0 4.0 5.0 3.0
In [74]:
# Create dataframe displaying the percentage of participants who entered each value.

df_analyze_education_cat = df[list(analyze_education_variables.keys())]
df_analyze_education_cat = df_analyze_education_cat.apply(lambda x: pd.value_counts(x, normalize=True))*100
df_analyze_education_cat = df_analyze_education_cat.T
df_analyze_education_cat.set_index([list(analyze_education_variables.values())], inplace=True)
df_analyze_education_cat
Out[74]:
1.0 2.0 3.0 4.0 5.0
Research design/methods 33.047210 37.768240 18.454936 6.008584 4.721030
Use of software tools 18.965517 26.293103 28.879310 17.672414 8.189655
Organizing data for analysis 24.137931 28.448276 28.879310 11.637931 6.896552
Documenting analysis decisions 11.587983 23.605150 22.746781 27.896996 14.163090
Sharing code in a useful manner 10.344828 17.672414 27.155172 27.155172 17.672414
In [75]:
#Create the figure

with sns.color_palette("Greens"):
    bar_analyze_collect_stacked = df_analyze_education_cat.plot(kind='barh', stacked=True, legend=False)

#Clean up the formatting
bar_analyze_collect_stacked.invert_yaxis()
bar_analyze_collect_stacked.set(xlim=(0, 100))
plt.xlabel("Percentage")
sns.despine(offset=10)

plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.axvline(25, color="k", linestyle="--");
plt.axvline(50, color="k", linestyle="--");
plt.axvline(75, color="k", linestyle="--");

Data Publishing/Sharing

Description The questions in this section concern activities and practices related to the communication or publication of your research results in a presentation or scholarly publication or the sharing of your data via a general or discipline-specific repository (e.g. Figshare, Dryad, Zenodo, ICPSR).

Back to table of contents


46. On a scale of 1-5, how would you rate the maturity of your data sharing activities?

In [76]:
#Print descriptive statistics. Interesting and informative, but remember responses are ordinal.

print("median   ", str(df["share_mature_self"].median()))
print(df["share_mature_self"].describe())
print(df["share_mature_self"].value_counts())
median    3.0
count    229.000000
mean       2.737991
std        1.225245
min        1.000000
25%        2.000000
50%        3.000000
75%        4.000000
max        5.000000
Name: share_mature_self, dtype: float64
2.0    59
3.0    57
4.0    51
1.0    44
5.0    18
Name: share_mature_self, dtype: int64

47. On a scale of 1-5, how would you rate the maturity of the field of psychology as a whole in regards to data sharing?

In [77]:
#Print descriptive statistics. Interesting and informative, but remember responses are ordinal.

print("median   ", str(df["share_mature_field"].median()))
print(df["share_mature_field"].describe())
print(df["share_mature_field"].value_counts())
median    2.0
count    225.000000
mean       2.297778
std        0.928346
min        1.000000
25%        2.000000
50%        2.000000
75%        3.000000
max        5.000000
Name: share_mature_field, dtype: float64
2.0    88
3.0    68
1.0    47
4.0    20
5.0     2
Name: share_mature_field, dtype: int64

48. On a scale of 1-5, how would you rate your willingness to change your data management practices during the data collection phase in response to new technologies, opportunities, or requirements?

In [78]:
#Print descriptive statistics. Interesting and informative, but remember responses are ordinal.

print("median   ", str(df["share_change"].median()))
print(df["share_change"].describe())
print(df["share_change"].value_counts())
median    4.0
count    229.000000
mean       4.104803
std        0.897083
min        1.000000
25%        4.000000
50%        4.000000
75%        5.000000
max        5.000000
Name: share_change, dtype: float64
5.0    89
4.0    89
3.0    39
2.0    10
1.0     2
Name: share_change, dtype: int64

49. Is there any reason part or all of your data cannot be shared? [Select all that apply]

In [79]:
share_cant_variables = {"share_cant_publish":"Yes. My data contains additional findings I wish to discover/publish",
                        "share_cant_sensitive":"Yes, my data contains confidential or sensitive information.",
                        "share_cant_irb":"Yes, I have not received institutional review board approval to share my data.",
                        "share_cant_format":"Yes, my data is in a format that makes it difficult to share with others.",
                        "share_cant_ip":"Yes, my data is proprietary or subject to intellectual property concerns",
                        "share_cant_pi":"Yes, my supervisor/collaborators do not wish to share the data.",
                        "share_cant_time":"Yes, it would take too much time or effort for me to share my data.",
                        "share_cant_knowledge":"Yes, I do not know how to share my data.",
                        "share_cant_other":"Yes. Other ",
                        "share_cant_no_authorship":"No, but I request authorship if others use my data",
                        "share_cant_no_citation":"No, but I request citation or acknowledgement if others use my data."}
                        
df_share_cant = pd.DataFrame({"Number": df[list(share_cant_variables.keys())].count(),
                              "Percentage": df[list(share_cant_variables.keys())].count()/df["share_cant"].sum()*100})

df_share_cant.set_index([list(share_cant_variables.values())], inplace=True)
df_share_cant
Out[79]:
Number Percentage
Yes. My data contains additional findings I wish to discover/publish 90 40.358744
Yes, my data contains confidential or sensitive information. 113 50.672646
Yes, I have not received institutional review board approval to share my data. 70 31.390135
Yes, my data is in a format that makes it difficult to share with others. 30 13.452915
Yes, my data is proprietary or subject to intellectual property concerns 14 6.278027
Yes, my supervisor/collaborators do not wish to share the data. 51 22.869955
Yes, it would take too much time or effort for me to share my data. 50 22.421525
Yes, I do not know how to share my data. 18 8.071749
Yes. Other 5 2.242152
No, but I request authorship if others use my data 19 8.520179
No, but I request citation or acknowledgement if others use my data. 52 23.318386
In [80]:
#Display text entered by participants who selected "other".

df["share_cant_other_text"].value_counts()
Out[80]:
I do not share the raw audio files on OSF because it would be cumbersome; I instead code them. All other data is shared on OSF. If somoene would want the audio files I would share them (e.g. posting a USB stick or simply emailing a zip drive if it's not too large).    1
I share data relevant to a paper. If for example there are variables that are part of a dataset but have never been analyzed, I do not necessarily share them.                                                                                                               1
Data collected with tribal agreement requires agreement to share. I still haven't figured out how this will be handled by my funding agency. They are vague on this point.                                                                                                   1
Not sure if anyone would be interested                                                                                                                                                                                                                                       1
sometimes is is proprietary... but mostly no                                                                                                                                                                                                                                 1
Name: share_cant_other_text, dtype: int64

50. Have you ever archived, deposited, or published a dataset in order to make it available to others?

In [81]:
share_archive_variables={"share_archive_article":"Yes, I have published my data as part of a journal article",
                         "share_archive_repo_government":"Yes, I have deposited my data in a government or funder sponsored repository",
                         "share_archive_OSF":"Yes, I have shared my data using the Open Science Framework (OSF)",
                         "share_archive_repo_other":"Yes, I have deposited my data into a general purpose repository besides the OSF",
                         "share_archive_repo_disclipline":"Yes, I have deposited my data in a discipline-specific repository",
                         "share_archive_repo_ir":"Yes, I have deposited my data in my institutional repository",
                         "share_archive_other":"Yes. Other",
                         "share_archive_none":"No"}

df_share_archive = pd.DataFrame({"Number": df[list(share_archive_variables.keys())].count(),
                              "Percentage": df[list(share_archive_variables.keys())].count()/df["share_archive"].sum()*100})

df_share_archive.set_index([list(share_archive_variables.values())], inplace=True)
df_share_archive
Out[81]:
Number Percentage
Yes, I have published my data as part of a journal article 76 33.187773
Yes, I have deposited my data in a government or funder sponsored repository 15 6.550218
Yes, I have shared my data using the Open Science Framework (OSF) 102 44.541485
Yes, I have deposited my data into a general purpose repository besides the OSF 15 6.550218
Yes, I have deposited my data in a discipline-specific repository 10 4.366812
Yes, I have deposited my data in my institutional repository 7 3.056769
Yes. Other 10 4.366812
No 85 37.117904
In [82]:
#Display text entered by participants who selected "other".

df["share_archive_other_text"].value_counts()
Out[82]:
I sometimes allow undergraduates to use my data for practice.                                          1
codeocean.com                                                                                          1
I have published a partial dataset as part of supplementary materials                                  1
I have sent data sets to researchers who requested them                                                1
I shared data upon request                                                                             1
GitHub                                                                                                 1
Currently under review for a paper sharing the data as part of the supplementary materials             1
Provided datasets via Github                                                                           1
delivered my data in email as SPSS file upon request by7 other investigators outside my institution    1
Deposited in lab specific website                                                                      1
Name: share_archive_other_text, dtype: int64

51. If applicable, what is your motivation for sharing your data (such as through uploading some or all of it to a general or discipline-specific repository)?

In [83]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

share_motivation_variables={"share_motivation_results":"To communicate my results and/or add to the scholarly literature",
                            "share_motivation_validity":"To allow other researchers to assess the validity of my conclusions",
                            "share_motivation_incentives":"Professional incentives",
                            "share_motivation_ip":"To establish intellectual property or patent claims.",
                            "share_motivation_mandate":"It is mandated by a funder, publisher, or my institution.",
                            "share_motivation_transparency":"To foster transparency and reproducibility.",
                            "share_motivation_reuse":"To foster re-use",
                            "share_motivation_other":"Other",
                            "share_motivation_na":"Not applicable, I do not share my data in this manner."}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_share_motivation = pd.DataFrame({"Number": df[list(share_motivation_variables.keys())].count(),
                                    "Percentage": df[list(share_motivation_variables.keys())].count()/df["share_motivation"].sum()*100})

df_share_motivation.set_index([list(share_motivation_variables.values())], inplace=True)
df_share_motivation
Out[83]:
Number Percentage
To communicate my results and/or add to the scholarly literature 128 56.387665
To allow other researchers to assess the validity of my conclusions 135 59.471366
Professional incentives 42 18.502203
To establish intellectual property or patent claims. 3 1.321586
It is mandated by a funder, publisher, or my institution. 62 27.312775
To foster transparency and reproducibility. 153 67.400881
To foster re-use 126 55.506608
Other 9 3.964758
Not applicable, I do not share my data in this manner. 46 20.264317
In [84]:
#Display text entered by participants who selected "other".

df["share_motivation_other_text"].value_counts()
Out[84]:
Teaching.                                                                                    1
To foster discovery by other groups that were not indended at the time of data collection    1
So that it can be used by junior scientists who don't have their own funding                 1
I have not yet shared my data but plan to in the future for the selected reasons.            1
Backup                                                                                       1
to allow others to address questions that my data might address.                             1
To establish an open-science track record for potential future jobs                          1
It's the right thing to do.                                                                  1
I share to the reviewer of the paper, upon request                                           1
Name: share_motivation_other_text, dtype: int64

52. Have you ever published in a journal that required you to share data or complete a data availability statement upon publication of your article?

In [85]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_share_publisher = pd.DataFrame({"Number": df["share_publisher"].value_counts(),
                                   "Percentage": df["share_publisher"].value_counts(normalize=True)*100})

df_share_publisher
Out[85]:
Number Percentage
No 93 40.434783
Yes, I have been required to complete a data availability statement. 66 28.695652
Yes, I have been required to both share data and complete a data availability statement. 43 18.695652
Yes, I have been required to share data.\t 15 6.521739
Im not sure. 13 5.652174

53. Have you ever requested data associated with a paper or other scholarly publication from another researcher?

In [86]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_share_request = pd.DataFrame({"Number": df["share_request"].value_counts(),
                                 "Percentage": df["share_request"].value_counts(normalize=True)*100})

df_share_request
Out[86]:
Number Percentage
No 135 58.695652
Yes 91 39.565217
I don't know. 4 1.739130

On a scale of 1 to 5 how often have you received such data in a form that was usable without a significant amount of effort?

In [87]:
print("median   ", str(df["share_request_usable"].median()))
print(df["share_request_usable"].describe())
print(df["share_request_usable"].value_counts())
median    3.0
count    90.000000
mean      2.944444
std       1.115379
min       1.000000
25%       2.000000
50%       3.000000
75%       4.000000
max       5.000000
Name: share_request_usable, dtype: float64
3.0    36
4.0    18
2.0    17
1.0    11
5.0     8
Name: share_request_usable, dtype: int64

54. Have you ever received a request for data associated with one of your papers or scholarly publications?

In [88]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_share_receive = pd.DataFrame({"Number": df["share_receive"].value_counts(),
                                 "Percentage": df["share_receive"].value_counts(normalize=True)*100})

df_share_receive
Out[88]:
Number Percentage
No 120 51.948052
Yes 110 47.619048
I don't know. 1 0.432900

On a scale of 1 to 5 how often have you been able to send the requested data in a usable form without a significant amount of effort?

In [89]:
print("median   ", str(df["share_receive_usable"].median()))
print(df["share_receive_usable"].describe())
print(df["share_receive_usable"].value_counts())
median    4.0
count    110.000000
mean       3.590909
std        1.228824
min        1.000000
25%        3.000000
50%        4.000000
75%        5.000000
max        5.000000
Name: share_receive_usable, dtype: float64
5.0    32
4.0    30
3.0    27
2.0    13
1.0     8
Name: share_receive_usable, dtype: int64

55. If you have requested data from another researcher or sought openly accessible data, what did you use it for?

In [90]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

share_use_variables={"share_use_replicate":"To verify or replicate their results.",
                     "share_use_meta":"As part of completing a meta-analysis.",
                     "share_use_extend":"To extend conclusions drawn from it or test alternative hypotheses.",
                     "share_use_test":"To learn a new technique, method, or tool.",
                     "share_use_none":"I did not end up using it.",
                     "share_use_other":"Other",
                     "share_use_na":"Not applicable."}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_share_use = pd.DataFrame({"Number": df[list(share_use_variables.keys())].count(),
                             "Percentage": df[list(share_use_variables.keys())].count()/df["share_use"].sum()*100})

df_share_use.set_index([list(share_use_variables.values())], inplace=True)
df_share_use
Out[90]:
Number Percentage
To verify or replicate their results. 31 13.596491
As part of completing a meta-analysis. 47 20.614035
To extend conclusions drawn from it or test alternative hypotheses. 54 23.684211
To learn a new technique, method, or tool. 25 10.964912
I did not end up using it. 14 6.140351
Other 14 6.140351
Not applicable. 117 51.315789
In [91]:
#Display text entered by participants who selected "other".

df["share_use_other_text"].value_counts()
Out[91]:
Teaching                                                                                          2
To obtain descriptives they did not publish                                                       1
For discovery and testing novel hypotheses                                                        1
to apply different analyses to the data set                                                       1
Original analyses                                                                                 1
Scale refinement                                                                                  1
Requested norms to select stimulus materials                                                      1
To recreate a figure for reuse in my publication.                                                 1
As part of peer review                                                                            1
To use as a teaching example                                                                      1
Test my models on other data                                                                      1
To check how certain measures were collected (e.g. continuous data made into categorical data)    1
New studies                                                                                       1
Name: share_use_other_text, dtype: int64

56. In general, do you believe that someone with a similar level of expertise could recreate your analysis steps using your description of them in a publication or scholarly report (without you being present)?

In [92]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_share_docs_others = pd.DataFrame({"Number": df["share_docs_others"].value_counts(),
                                     "Percentage": df["share_docs_others"].value_counts(normalize=True)*100})

df_share_docs_others
Out[92]:
Number Percentage
Yes, someone could recreate both my data cleaning/coding and analysis steps. 122 53.508772
Someone could recreate my data analysis but not cleaning/coding steps. 68 29.824561
I am not sure. 19 8.333333
No, I would have to be present. 17 7.456140
Someone could recreate my cleaning/coding but not analysis steps. 2 0.877193

57. How long do you (or your lab) typically keep a dataset?

In [93]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_share_preserve_time = pd.DataFrame({"Number": df["share_preserve_time"].value_counts(),
                                     "Percentage": df["share_preserve_time"].value_counts(normalize=True)*100})

df_share_preserve_time
Out[93]:
Number Percentage
8+ years (but in a formats that it may become obsolete) 98 42.982456
8+ years (and maintained so it is always accessible) 57 25.000000
4-8 years after the conclusion of a project 27 11.842105
I dont know. 26 11.403509
Other (Please describe) 11 4.824561
1-3 years after the conclusion of a project 8 3.508772
Less than a year 1 0.438596
In [94]:
#Display text entered by participants who selected "other".

df["share_preserve_time_text"].value_counts()
Out[94]:
I don't destroy data                                                                                                                                                                                                                                                                   1
no limit                                                                                                                                                                                                                                                                               1
All the data are on OSF. So I guess indefinitely?                                                                                                                                                                                                                                      1
I have data from all projects and intend to keep                                                                                                                                                                                                                                       1
Our lab is very young -- we've kept all our datasets hitertho, but have not discussed for how long (it's implied that it will be for a long time, but we're unsure how to define that)                                                                                                 1
None of our projects have concluded yet.                                                                                                                                                                                                                                               1
my lab is only 3 years old and we are just completing the first project                                                                                                                                                                                                                1
Saved in institutional repository, have no control over how long they keep it (also switched institutions in the meantime...)                                                                                                                                                          1
My intention is to keep all data indefinitely, but I haven't ever thought about formats becoming obsolete, so I haven't considered maintenance in detail. Perhaps that isn't necessary when my data is usually saved as .csv files, but thanks for bringing that to my attention!      1
Infinitely                                                                                                                                                                                                                                                                             1
Too early in my career to be sure - but data are on the OSF in .csv format.                                                                                                                                                                                                            1
Name: share_preserve_time_text, dtype: int64

58. What are the important components of your research to preserve long term (after the conclusion of a project)?

In [95]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

share_preserve_variables={"share_preserve_what_av":"Audio/visual recordings",
                          "share_preserve_what_demographic":"Demographic data ",
                          "share_preserve_what_medical":"Clinical or Medical data",
                          "share_preserve_what_scales_quantitative":"Quantitative data from questionnaires",
                          "share_preserve_what_scales_qualitative":"Qualitative data from questionnaires",
                          "share_preserve_what_behavioral":"Behavioral data",
                          "share_preserve_what_qualitative":"Qualitative data",
                          "share_preserve_what_neurphysiological":"Data from neuropsychological or aptitude tests ",
                          "share_preserve_what_neuroimaging":"Neuroimaging data",
                          "share_preserve_what_writing":"Data from written documents",
                          "share_preserve_what_physiological":"Physiological data",
                          "share_preserve_what_genetic":"Genetic/molecular data",
                          "share_preserve_what_eye_tracking":"Eye tracking/pupillometry data ",
                          "share_preserve_what_session":"Information about the data collection session ",
                          "share_preserve_what_paradigm":"Task-related information",
                          "share_preserve_what_stimuli":"Task-related stimuli",
                          "share_preserve_what_code_collection":"Computer code used for data collection ",
                          "share_preserve_what_schemes":"Coding materials",
                          "share_preserve_what_consent":"Informed consent-related documentation",
                          "share_preserve_what_other":"Other"}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_share_preserve = pd.DataFrame({"Number": df[list(share_preserve_variables.keys())].count(),
                                  "Percentage": df[list(share_preserve_variables.keys())].count()/df["share_preserve_what"].sum()*100})

df_share_preserve.set_index([list(share_preserve_variables.values())], inplace=True)
df_share_preserve
Out[95]:
Number Percentage
Audio/visual recordings 56 24.778761
Demographic data 142 62.831858
Clinical or Medical data 46 20.353982
Quantitative data from questionnaires 172 76.106195
Qualitative data from questionnaires 51 22.566372
Behavioral data 145 64.159292
Qualitative data 23 10.176991
Data from neuropsychological or aptitude tests 35 15.486726
Neuroimaging data 56 24.778761
Data from written documents 19 8.407080
Physiological data 38 16.814159
Genetic/molecular data 14 6.194690
Eye tracking/pupillometry data 47 20.796460
Information about the data collection session 82 36.283186
Task-related information 85 37.610619
Task-related stimuli 101 44.690265
Computer code used for data collection 109 48.230088
Coding materials 110 48.672566
Informed consent-related documentation 90 39.823009
Other 12 5.309735
In [96]:
#Display text entered by participants who selected "other".

df["share_preserve_what_other_text"].value_counts()
Out[96]:
Motion capture data                                                                                                1
I would save any data collected/created, except informed consent which I typically destroy within 5 or so years    1
the scales / questions that produced the data (e.g., the survey protocol)                                          1
All that is collected?                                                                                             1
Code/output from statistical analyses                                                                              1
I don't understand this question?                                                                                  1
see reply above                                                                                                    1
Name: share_preserve_what_other_text, dtype: int64

59. On a scale of 1 (No Need) to 5 (High level of need), please indicate your level of need for training or education for each of the following

In [97]:
#This question contained multiple parts, participants gave one answer to each.

#Create dictionary containing responses.

sharing_education_variables={"share_education_useful":"Sharing data in a form that ensures it will be useful to others",
                             "share_education_platforms":"Use of different platforms, repositories, or tools for sharing data ",
                             "share_education_confidentiality":"Protecting participant confidentiality in shared data.",
                             "share_education_reuse":"Understanding reuse rights related to data",
                             "share_education_archiving":"Best practices for preserving or archiving data over the long term."}
In [98]:
#Create dataframe containing descriptive statistics. 

#They're interesting and informative, but remember responses are ordinal.

df_sharing_education_cont= df[list(sharing_education_variables.keys())].describe().T
df_sharing_education_cont["median"]= df[list(sharing_education_variables.keys())].median().T
df_sharing_education_cont.set_index([list(sharing_education_variables.values())], inplace=True)
df_sharing_education_cont
Out[98]:
count mean std min 25% 50% 75% max median
Sharing data in a form that ensures it will be useful to others 228.0 3.280702 1.202118 1.0 2.0 3.0 4.0 5.0 3.0
Use of different platforms, repositories, or tools for sharing data 227.0 3.215859 1.269839 1.0 2.0 3.0 4.0 5.0 3.0
Protecting participant confidentiality in shared data. 228.0 2.929825 1.415579 1.0 2.0 3.0 4.0 5.0 3.0
Understanding reuse rights related to data 227.0 3.436123 1.286113 1.0 2.0 4.0 4.0 5.0 4.0
Best practices for preserving or archiving data over the long term. 228.0 3.675439 1.234565 1.0 3.0 4.0 5.0 5.0 4.0
In [99]:
df_sharing_education_cat = df[list(sharing_education_variables.keys())]
df_sharing_education_cat = df_sharing_education_cat.apply(lambda x: pd.value_counts(x, normalize=True))*100
df_sharing_education_cat = df_sharing_education_cat.T
df_sharing_education_cat.set_index([list(sharing_education_variables.values())], inplace=True)
df_sharing_education_cat
Out[99]:
1.0 2.0 3.0 4.0 5.0
Sharing data in a form that ensures it will be useful to others 10.087719 15.789474 25.877193 32.456140 15.789474
Use of different platforms, repositories, or tools for sharing data 11.894273 18.502203 23.348018 28.634361 17.621145
Protecting participant confidentiality in shared data. 20.614035 23.245614 17.105263 20.614035 18.421053
Understanding reuse rights related to data 8.810573 18.942731 16.740088 30.837004 24.669604
Best practices for preserving or archiving data over the long term. 7.456140 10.964912 19.736842 30.263158 31.578947
In [100]:
#Create a stacked bar chart to display the data.

with sns.color_palette("Greens"):
    bar_education_collect_stacked = df_sharing_education_cat.plot(kind='barh', stacked=True, legend=False)

#Clean up the formatting

bar_education_collect_stacked.invert_yaxis()
bar_education_collect_stacked.set(xlim=(0, 100))
plt.xlabel("Percentage")
sns.despine(offset=10)
plt.legend(bbox_to_anchor=(1, 1), loc=2)

plt.axvline(25, color="k", linestyle="--");
plt.axvline(50, color="k", linestyle="--");
plt.axvline(75, color="k", linestyle="--");

Emerging Publication Practices

Description: The questions in this section concern activities, practices, and plans related to new (or newly visible) ways of communicating, disseminating, or sharing material related to your research.

Back to table of contents


60. What is your motivation for publishing a scholarly article that describes conclusions drawn from your data (such as a peer-reviewed journal article)?

In [101]:
#Participants could select multiple responses for this question.

#Create dictionary containing responses.

sc_publish_variables={"sc_publish_communicate":"To communicate my results and/or add to the scholarly literature",
                      "sc_publish_validity":"To allow other researchers to assess the validity of my conclusions.",
                      "sc_publish_incentives":"Professional incentives (e.g. authorship or citations are required for promotion)",
                      "sc_publish_ip":"To establish intellectual property or patent claims.",
                      "sc_publish_funder":"It is expected by my funding agency.",
                      "sc_publish_employer":"It is expected by my employer.",
                      "sc_publish_other":"Other"}

#Create dataframe containing both the number and percentage of responding participants who selected each response.

df_sc_publish = pd.DataFrame({"Number": df[list(sc_publish_variables.keys())].count(),
                              "Percentage": df[list(sc_publish_variables.keys())].count()/df["share_use"].sum()*100})

df_sc_publish.set_index([list(sc_publish_variables.values())], inplace=True)
df_sc_publish
Out[101]:
Number Percentage
To communicate my results and/or add to the scholarly literature 224 98.245614
To allow other researchers to assess the validity of my conclusions. 156 68.421053
Professional incentives (e.g. authorship or citations are required for promotion) 190 83.333333
To establish intellectual property or patent claims. 19 8.333333
It is expected by my funding agency. 92 40.350877
It is expected by my employer. 135 59.210526
Other 4 1.754386

61. Are you limited in addressing your research questions by a lack of access to research data collected by others

In [102]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_sc_limited = pd.DataFrame({"Number": df["sc_limited"].value_counts(),
                              "Percentage": df["sc_limited"].value_counts(normalize=True)*100})

df_sc_limited
Out[102]:
Number Percentage
No 136 59.911894
Yes 70 30.837004
I dont know 21 9.251101

62. Do you consider data to be a “first class” research product?

In [103]:
#Participants could only give one response to this question.

#Create dataframe containing both number of responses and normalized responses (percentage). 

df_sc_firstclass = pd.DataFrame({"Number": df["sc_firstclass"].value_counts(),
                                 "Percentage": df["sc_firstclass"].value_counts(normalize=True)*100})

df_sc_firstclass
Out[103]:
Number Percentage
Yes 106 46.696035
I dont know 63 27.753304
No 58 25.550661

63. Please indicate if you are currently doing any of the following activities as well as if you and if you plan do any in the future.

In [104]:
#This question contained multiple parts, participants gave one answer to each.

#Create dictionary containing responses.


sc_activities_current={"sc_activities_current_preprint":"Publish a preprint",
                       "sc_activities_current_oa_gold":"Publish in an open access journal",
                       "sc_activities_current_oa_green":"Deposit an author’s accepted manuscript",
                       "sc_activities_current_cite_data":"Cite a dataset",
                       "sc_activities_current_data_paper":"Cite code or software",
                       "sc_activities_current_cite_code":"Publish a data paper or publish a dataset",
                       "sc_activities_current_data_mediated":"Make data available, but only to researchers with appropriate credentials",
                       "sc_activities_current_share_materials":"Share or publish other research materials",
                       "sc_activities_current_share_protocol":"Share or publish a study protocol",
                       "sc_activities_current_preregister":"Pre-register a study",
                       "sc_activities_current_register_report":"Submitting a registered report",
                       "sc_activities_current_curation":"Take advantage of a data curation or research data management service",
                       "sc_activities_current_replication":"Publish a direct replication of a previously published study"}
                      
#Create dataframe containing both the number and percentage of responding participants who selected each response.
    
df2 = df[list(sc_activities_current.keys())]
df2 = df2.apply(lambda x: pd.value_counts(x, normalize=True))*100
df2 = df2.T
df2.set_index([list(sc_activities_current.values())], inplace=True)
df2
Out[104]:
I don't know No Yes
Publish a preprint 3.125000 54.017857 42.857143
Publish in an open access journal 2.714932 34.389140 62.895928
Deposit an author’s accepted manuscript 6.278027 40.358744 53.363229
Cite a dataset 4.910714 64.732143 30.357143
Cite code or software 4.444444 85.333333 10.222222
Publish a data paper or publish a dataset 3.125000 35.267857 61.607143
Make data available, but only to researchers with appropriate credentials 8.520179 70.852018 20.627803
Share or publish other research materials 5.357143 51.339286 43.303571
Share or publish a study protocol 2.666667 68.444444 28.888889
Pre-register a study 2.242152 45.291480 52.466368
Submitting a registered report 7.142857 77.678571 15.178571
Take advantage of a data curation or research data management service 7.623318 77.578475 14.798206
Publish a direct replication of a previously published study 3.571429 73.214286 23.214286
In [105]:
#This question contained multiple parts, participants gave one answer to each.

#Create dictionary containing responses.

sc_activities_future={"sc_activities_future_preprint":"Publish a preprint",
                       "sc_activities_future_oa_gold":"Publish in an open access journal",
                       "sc_activities_future_oa_green":"Deposit an author’s accepted manuscript",
                       "sc_activities_future_cite_data":"Cite a dataset",
                       "sc_activities_future_data_paper":"Cite code or software",
                       "sc_activities_future_cite_code":"Publish a data paper or publish a dataset",
                       "sc_activities_future_data_mediated":"Make data available, but only to researchers with appropriate credentials",
                       "sc_activities_future_share_materials":"Share or publish other research materials",
                       "sc_activities_future_share_protocol":"Share or publish a study protocol",
                       "sc_activities_future_preregister":"Pre-register a study",
                       "sc_activities_future_register_report":"Submitting a registered report",
                       "sc_activities_future_curation":"Take advantage of a data curation or research data management service",
                       "sc_activities_future_replication":"Publish a direct replication of a previously published study"}
                      
#Create dataframe containing both the number and percentage of responding participants who selected each response.
   
df3 = df[list(sc_activities_future.keys())]
df3 = df3.apply(lambda x: pd.value_counts(x, normalize=True))*100
df3 = df3.T
df3.set_index([list(sc_activities_future.values())], inplace=True)
df3
Out[105]:
I don't know I dont know No Yes
Publish a preprint 27.027027 NaN 9.459459 63.513514
Publish in an open access journal 12.669683 NaN 3.167421 84.162896
Deposit an author’s accepted manuscript NaN 23.423423 10.360360 66.216216
Cite a dataset NaN 43.303571 6.250000 50.446429
Cite code or software NaN 45.982143 27.678571 26.339286
Publish a data paper or publish a dataset NaN 24.200913 4.566210 71.232877
Make data available, but only to researchers with appropriate credentials NaN 40.625000 25.000000 34.375000
Share or publish other research materials NaN 28.888889 11.555556 59.555556
Share or publish a study protocol NaN 33.035714 18.750000 48.214286
Pre-register a study NaN 20.888889 5.777778 73.333333
Submitting a registered report NaN 35.426009 9.417040 55.156951
Take advantage of a data curation or research data management service NaN 53.571429 16.964286 29.464286
Publish a direct replication of a previously published study NaN 42.857143 12.500000 44.642857

64. On a scale of 1 (No Need) to 5 (High level of need), please indicate your level of need for training or education for each of the following:

In [106]:
#This question contained multiple parts, participants gave one answer to each.

#Create dictionary containing responses.

sc_education_variables={"sc_education_oa":"Open access publishing",
                        "sc_education_preregister":"Preregistering studies",
                        "sc_education_tools":"Using open science software tools",
                        "sc_education_practices":"Applying open science practices",
                        "sc_education_datasets":"Finding and using openly available datasets"}
In [107]:
#Create dataframe containing descriptive statistics. 

#They're interesting and informative, but remember responses are ordinal.

df_sharing_education_cont= df[list(sc_education_variables.keys())].describe().T
df_sharing_education_cont["median"]= df[list(sc_education_variables.keys())].median().T
df_sharing_education_cont.set_index([list(sc_education_variables.values())], inplace=True)
df_sharing_education_cont
Out[107]:
count mean std min 25% 50% 75% max median
Open access publishing 227.0 2.555066 1.350290 1.0 1.0 2.0 4.0 5.0 2.0
Preregistering studies 227.0 2.801762 1.289987 1.0 2.0 3.0 4.0 5.0 3.0
Using open science software tools 226.0 3.079646 1.360664 1.0 2.0 3.0 4.0 5.0 3.0
Applying open science practices 227.0 3.048458 1.307681 1.0 2.0 3.0 4.0 5.0 3.0
Finding and using openly available datasets 226.0 3.203540 1.347652 1.0 2.0 3.0 4.0 5.0 3.0
In [108]:
# Create dataframe displaying the percentage of participants who entered each value.

df_sc_education_cat = df[list(sc_education_variables.keys())]
df_sc_education_cat = df_sc_education_cat.apply(lambda x: pd.value_counts(x, normalize=True))*100
df_sc_education_cat = df_sc_education_cat.T
df_sc_education_cat.set_index([list(sc_education_variables.values())], inplace=True)
df_sc_education_cat
Out[108]:
1.0 2.0 3.0 4.0 5.0
Open access publishing 31.718062 18.942731 20.264317 20.264317 8.810573
Preregistering studies 21.145374 20.704846 25.550661 22.026432 10.572687
Using open science software tools 17.699115 18.141593 19.911504 26.991150 17.256637
Applying open science practices 15.418502 19.823789 25.991189 22.026432 16.740088
Finding and using openly available datasets 14.159292 19.026549 19.911504 26.106195 20.796460
In [109]:
#Create a stacked bar chart to display the data.

with sns.color_palette("Greens"):
    sc_stacked = df_sharing_education_cat.plot(kind='barh', stacked=True, legend=False)

#Clean up the formatting
sc_stacked.invert_yaxis()
sc_stacked.set(xlim=(0, 100))
plt.xlabel("Percentage")
sns.despine(offset=10)

plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.axvline(25, color="k", linestyle="--");
plt.axvline(50, color="k", linestyle="--");
plt.axvline(75, color="k", linestyle="--");

Comparisons and Analyses

Back to table of contents


Data Collection - Self vs Field Ratings

In [110]:
stats.mannwhitneyu(df['collect_mature_self'],df['collect_mature_field'])
Out[110]:
MannwhitneyuResult(statistic=27770.5, pvalue=2.8968261389290187e-08)

Data Analysis - Self vs Field Ratings

In [111]:
stats.mannwhitneyu(df['analyze_mature_self'],df['analyze_mature_field'])
Out[111]:
MannwhitneyuResult(statistic=29030.0, pvalue=1.1899933385738388e-06)

Data Sharing - Self vs Field Ratings

In [112]:
stats.mannwhitneyu(df['share_mature_self'],df['share_mature_field'])
Out[112]:
MannwhitneyuResult(statistic=33280.0, pvalue=0.009586549013812413)

Compare Self Ratings

In [113]:
stats.kruskal(df['collect_mature_self'].dropna(),df['analyze_mature_self'].dropna(),df['share_mature_self'].dropna())
Out[113]:
KruskalResult(statistic=39.238094895847624, pvalue=3.016865531824124e-09)

Compare Field Ratings

In [114]:
stats.kruskal(df['collect_mature_field'].dropna(),df['analyze_mature_field'].dropna(),df['share_mature_field'].dropna())
Out[114]:
KruskalResult(statistic=22.21091229208808, pvalue=1.5030093308864582e-05)

Compare Readiness to Change Ratings

In [115]:
stats.kruskal(df['collect_change'].dropna(),df['analyze_change'].dropna(),df['share_change'].dropna())
Out[115]:
KruskalResult(statistic=1.3619150672206677, pvalue=0.5061321217448382)

Compare Data Sent vs Recieved

In [116]:
stats.mannwhitneyu(df["share_receive_usable"],df["share_request_usable"])
Out[116]:
MannwhitneyuResult(statistic=28544.5, pvalue=5.893290691047199e-07)

Combined Data and Figures

Back to table of contents


Figure 1 - Limits and Motivations for Current Data Managment Practices

In [117]:
#This question contained multiple parts, participants gave one answer to each.

#Create dictionary containing responses.

limits_variables = {"limits_time": "The amount of time it takes",
                    "limits_incentives": "Lack of professional incentives",
                    "limits_training": "Lack of training",
                    "limits_norms": "Lack of norms or best practices",
                    "limits_support": "Lack of institutional support",
                    "limits_knowldge": "I am unaware of best practices",
                    "limits_guidance": "Lack of guidance from my PI/collaborators",
                    "limits_data": "The characteristics of my data limit what I can do",
                    "limits_cost": "The financial cost",
                    "limits_pi": "Requirements of PI/collaborators"} 

# Create dataframe displaying the percentage of participants who entered each value.

df_limits_cat = df[list(limits_variables.keys())]
df_limits_cat = df_limits_cat.apply(lambda x: pd.value_counts(x, normalize=True))*100
df_limits_cat = df_limits_cat.T
df_limits_cat.set_index([list(limits_variables.values())], inplace=True)
In [118]:
#Create a stacked bar chart to display the data.

bar_limits_stacked = df_limits_cat.plot(kind='barh', stacked=True, legend=False, color=["#dad7cb", '#b6b1a9', "#8f1425","#928b81","#5f574f"])


    
bar_limits_stacked.invert_yaxis()
bar_limits_stacked.set(xlim=(0, 100))
plt.xlabel("Percentage")
bar_limits_stacked.get_xaxis().set_ticks([0,25,50,75,100])
sns.despine(offset=10)
plt.legend(bbox_to_anchor=(1, 1), loc=2)

plt.axvline(25, color="k", linestyle="--");
plt.axvline(50, color="k", linestyle="--");
plt.axvline(75, color="k", linestyle="--");

plt.savefig("Desktop/psych_limits.png", format='png', dpi=1000, bbox_inches="tight")
In [119]:
#This question contained multiple parts, participants gave one answer to each.

#Create dictionary containing responses.

motivations_variables = {"motivations_loss": "Prevent loss of data",
                         "motivations_reproducibility": "Desire to foster reproducibility",
                         "motivations_transparency": "Desire to foster research transparency",
                         "motivations_continuity": "Ensure continuity as research team changes",
                         "motivations_compliance_ethics": "Compliance with legal/ethical frameworks",
                         "motivations_best_practice": "Awareness of best practices",
                         "motivations_compliance_funding": "Compliance with mandates from funder/publisher",
                         "motivations_guidance": "Availability of guidance of best practices",
                         "motivations_pi": "Guidance from PI/Collaborators",
                         "motivations_training": "Availability of training",
                         "motivations_support": "Institutional support"}

# Create dataframe displaying the percentage of participants who entered each value.

df_motivations_cat = df[list(motivations_variables.keys())]
df_motivations_cat = df_motivations_cat.apply(lambda x: pd.value_counts(x, normalize=True))*100
df_motivations_cat = df_motivations_cat.T
df_motivations_cat.set_index([list(motivations_variables.values())], inplace=True)
In [120]:
bar_motivations_stacked = df_motivations_cat.plot(kind='barh', stacked=True, legend=False, color=["#dad7cb", '#b6b1a9', "#14628f","#928b81","#5f574f"])

#Clean up the formatting

bar_motivations_stacked.invert_yaxis()
bar_motivations_stacked.set(xlim=(0, 100))
plt.xlabel("Percentage")
bar_motivations_stacked.get_xaxis().set_ticks([0,25,50,75,100])
bar_motivations_stacked.get_xaxis().set_ticks([0,25,50,75,100])
sns.despine(offset=10)
plt.legend(bbox_to_anchor=(1, 1), loc=2
          )

plt.axvline(25, color="k", linestyle="--");
plt.axvline(50, color="k", linestyle="--");
plt.axvline(75, color="k", linestyle="--");

plt.savefig("Desktop/psych_motivations.png", format='png', dpi=1000, bbox_inches="tight")

Figure 1 - Limits and Motivations for Current Data Managment Practices

In [121]:
#This question contained multiple parts, participants gave one answer to each.

#Create dictionary containing responses.


sc_activities_current={"sc_activities_current_data_paper":"Publish a data paper or publish a dataset",
                       "sc_activities_current_register_report":"Submitting a registered report",
                       "sc_activities_current_curation":"Take advantage of a data curation or research data management service",
                       "sc_activities_current_replication":"Publish a direct replication of a previously published study",
                       "sc_activities_current_data_mediated":"Make data available, but only to researchers with appropriate credentials",
                       "sc_activities_current_share_protocol":"Share or publish a study protocol",
                       "sc_activities_current_cite_data":"Cite a dataset",
                       "sc_activities_current_preprint":"Publish a preprint",
                       "sc_activities_current_share_materials":"Share or publish other research materials",
                       "sc_activities_current_preregister":"Pre-register a study",
                       "sc_activities_current_oa_green":"Deposit an author’s accepted manuscript",
                       "sc_activities_current_cite_code":"Cite code or software",
                       "sc_activities_current_oa_gold":"Publish in an open access journal"}
#Create dataframe containing both the number and percentage of responding participants who selected each response.
    
    
    
df2 = df[list(sc_activities_current.keys())]
df2 = df2.apply(lambda x: pd.value_counts(x, normalize=True))*100
df2 = df2.T
df2.set_index([list(sc_activities_current.values())], inplace=True)
df2
Out[121]:
I don't know No Yes
Publish a data paper or publish a dataset 4.444444 85.333333 10.222222
Submitting a registered report 7.142857 77.678571 15.178571
Take advantage of a data curation or research data management service 7.623318 77.578475 14.798206
Publish a direct replication of a previously published study 3.571429 73.214286 23.214286
Make data available, but only to researchers with appropriate credentials 8.520179 70.852018 20.627803
Share or publish a study protocol 2.666667 68.444444 28.888889
Cite a dataset 4.910714 64.732143 30.357143
Publish a preprint 3.125000 54.017857 42.857143
Share or publish other research materials 5.357143 51.339286 43.303571
Pre-register a study 2.242152 45.291480 52.466368
Deposit an author’s accepted manuscript 6.278027 40.358744 53.363229
Cite code or software 3.125000 35.267857 61.607143
Publish in an open access journal 2.714932 34.389140 62.895928
In [122]:
#This question contained multiple parts, participants gave one answer to each.

#Create dictionary containing responses.

sc_activities_future={ "sc_activities_future_data_paper":"Publish a data paper or publish a dataset",
                       "sc_activities_future_register_report":"Submitting a registered report",
                       "sc_activities_future_curation":"Take advantage of a data curation or research data management service",
                       "sc_activities_future_replication":"Publish a direct replication of a previously published study",
                       "sc_activities_future_data_mediated":"Make data available, but only to researchers with appropriate credentials",
                       "sc_activities_future_share_protocol":"Share or publish a study protocol",
                       "sc_activities_future_cite_data":"Cite a dataset",
                       "sc_activities_future_preprint":"Publish a preprint",
                       "sc_activities_future_share_materials":"Share or publish other research materials",
                       "sc_activities_future_preregister":"Pre-register a study",
                       "sc_activities_future_oa_green":"Deposit an author’s accepted manuscript",
                       "sc_activities_future_cite_code":"Cite code or software",
                       "sc_activities_future_oa_gold":"Publish in an open access journal"}
                      
    
    
    
#Create dataframe containing both the number and percentage of responding participants who selected each response.
   
df3 = df[list(sc_activities_future.keys())]
df3 = df3.apply(lambda x: pd.value_counts(x, normalize=True))*100
df3 = df3.T
df3.set_index([list(sc_activities_future.values())], inplace=True)
df3
Out[122]:
I don't know I dont know No Yes
Publish a data paper or publish a dataset NaN 45.982143 27.678571 26.339286
Submitting a registered report NaN 35.426009 9.417040 55.156951
Take advantage of a data curation or research data management service NaN 53.571429 16.964286 29.464286
Publish a direct replication of a previously published study NaN 42.857143 12.500000 44.642857
Make data available, but only to researchers with appropriate credentials NaN 40.625000 25.000000 34.375000
Share or publish a study protocol NaN 33.035714 18.750000 48.214286
Cite a dataset NaN 43.303571 6.250000 50.446429
Publish a preprint 27.027027 NaN 9.459459 63.513514
Share or publish other research materials NaN 28.888889 11.555556 59.555556
Pre-register a study NaN 20.888889 5.777778 73.333333
Deposit an author’s accepted manuscript NaN 23.423423 10.360360 66.216216
Cite code or software NaN 24.200913 4.566210 71.232877
Publish in an open access journal 12.669683 NaN 3.167421 84.162896
In [123]:
#Create a figure to compare "yes" responses.


df4 = pd.DataFrame({"Current":df2["Yes"],
                    "Future":df3["Yes"]})

with sns.color_palette("tab10"):
    a_stacked = df4.plot(kind='barh', legend=False)
    
#Clean up the formatting
a_stacked.invert_yaxis()
a_stacked.set(xlim=(0, 100))
plt.xlabel("Percentage")
sns.despine(offset=10)
plt.legend(bbox_to_anchor=(1, 1), loc=2)


plt.savefig("Desktop/psych_future.png", format='png', dpi=1000, bbox_inches="tight")