Data from: Who is using AI to code? Global diffusion and impact of generative AI
Data files
Jan 21, 2026 version files 3.12 GB
-
final_data_2.zip
3.12 GB
-
README.md
12.13 KB
Abstract
Generative coding tools promise big productivity gains, but uneven uptake could widen skill and income gaps. We train a neural classifier to spot AI-generated Python functions in over 30 million GitHub commits by 160,000 developers, tracking how fast —and where— these tools take hold. Today, AI writes an estimated 29 % of Python functions in the US, a modest and shrinking lead over other countries. We estimate that quarterly output, measured in online code contributions, has increased by 3.6% because of this. Our evidence suggests that programmers using AI may also more readily expand into new domains of software development. However, experienced, senior-level programmers capture nearly all productivity and exploration gains, while we observe no significant benefits of AI adoption for early-career programmers. This widens, rather than closes, the skill gap.
Dataset DOI: 10.5061/dryad.3r2280gv0
Description of the data and file structure
Data for replication of the paper "Who is using AI to code? Global diffusion and impact of generative AI."
Files and variables
File: final_data_2.zip
final_data directory
Reference for the inputs used by the notebooks and scripts.
All Excel workbooks have been flattened to per-sheet CSVs in clean_csv/, preserving values while removing formatting.
Top-level files
AI_010_data_uq.parquet— User-quarter panel used for cross-checks in gender/experience analyses.country_functions.csv— Function-level AI share data with country codes and years (non-US sample).full_countries.csv— Country-level adoption estimates with corrected shares and standard errors by year.functions.csv— Function-level AI share data (US sample); used in diffusion/ttest and panel scripts.hist_data.npz— Pre-binned histogram counts/edges for classifier score plots (Supplementary S2).panel_with_libs_coarse.csv— User-project-quarter commit counts and coarse library metrics for panel construction.project_by_project_library_bipartite_rca_sbm.csv— Project clustering output for RCA/SBM communities.project_by_task_from_library_uni_network.csv— Project-to-task community mapping used for PMI/Louvain communities.pyfunctions_ai_classified.parquet— Labeled Python function dataset with AI vs human predictions and metadata.raw_data_encrypted_final.csv.zip— Core raw dataset of user-level AI shares and demographics.
Model evaluation outputs
newmodels_data/— Pickled score distributions for newer models:claude-sonnet-4-20250514_42_shuffled_{988,997}.pkl,deepseek-V3_{42,1414}_{988,997}.pkl,gpt-4.1_1414_shuffled_{988,997}.pkl,o3_42_{988,997}.pkl.wild_data/— Pickled WildChat evaluation scores:gpt-3.5-turbo_{assist,synth}.pkl,gpt-4_{assist,synth}.pkl.
Library network data (Supplementary S7)
s7_datasets/library_community_network_20_compact.csv— Cleaned library co-occurrence network with communities and descriptions.s7_datasets/temp_lib_co/lib_matrix_124.pkl— Library-by-library co-occurrence matrix.s7_datasets/temp_lib_co/lib_std_124.pkl— Library index (ordered standard library list) aligned with the matrix.s7_datasets/temp_lib_co/node_color_dict_124.pkl— Community color mapping for visualization.s7_datasets/temp_lib_co/node_community_dict_124.pkl— Library-to-community assignments.s7_datasets/temp_lib_co/project_file_libraries_end.csv— Project-level library listings used to build the co-occurrence inputs.
Salary, task, and occupation data (Supplementary S8/S17)
s8_salary_data/Task Statements.xlsx,Task Ratings.xlsx,Scales Reference.xlsx— O*NET task text and rating scales.s8_salary_data/national_M2024_dl.xlsx,national_M2024_dl_pure_data.xlsx— BLS national wage and employment tables.s8_salary_data/ipums_*.csv,ipums_*.pkl,ipums_occupation_soc.xlsx,usa_00003.dat.gz,usa_00003.xml— IPUMS microdata extracts and crosswalks to SOC codes.s8_salary_data/job_task_programming_score*.pkl— LLM-assigned programming-intensity scores (raw, adjusted, percent versions).s8_salary_data/job_ftvector_list.pkl,job_tasklist_ft.pkl,job_tasktype_list.pkl— Task frequency/time vectors and task lists per SOC occupation.s8_salary_data/job_*programming_hour_*.pkl,job_im_weighted_time.pkl,job_imrank_weighted_time.pkl— Derived working-time weights and programming-hour aggregates (BLS and IPUMS variants).s8_salary_data/annual_salary_bls*.pkl,hour_salary_bls*.pkl,employment_count_bls*.pkl— Hourly/annual wage and employment counts with adjusted variants.s8_salary_data/soc_working_time_vector_example_*.pkl— Example working-time vectors under different hour assumptions.s8_salary_data/random_salary_dict_10000.pkl— Monte Carlo salary draws for uncertainty analysis.s8_salary_data/task_content.pkl,task_score_percent.csv— Task text and scoring summaries.s8_salary_data/correlation.xlsx,working_hour_per_year.csv,job_frequency_time.csv— Supporting correlation tables and working-hour assumptions.s8_salary_data/db_29_3_excel/— Full O*NET reference workbooks (Skills, Abilities, Knowledge, Work Activities/Context, Tasks, Technology Skills, etc.) plus accompanying metadata and crosswalk tables.
final_data directory
Reference for the inputs used by the notebooks and scripts.
All Excel workbooks have been flattened to per-sheet CSVs in clean_csv/, preserving values while removing formatting.
Column names and units (tabular files)
country_functions.csv— Columns:Unnamed: 0(row index),country(country code),ai_share(fraction 0–1 of functions with AI),function_hash(hashed function id),year(YYYY).full_countries.csv— Columns:country(code),year(YYYY),std_error(standard error on share),country_probability(country-level adoption share, 0–1).functions.csv— Columns:user_hashed(hashed user id),function_modified_names(function name),ai_share(fraction 0–1),date(YYYY-MM-DD).panel_with_libs_coarse.csv— User-project-quarter panel;user_hashed,project,year_quarter(e.g., 2023Q1),year,quarter, commit counts (n_commits,n_commits_multi_files,n_commits_with_import), AI share (ai_share_window, 0–1), project sizes (project_n_people,project_n_commits,project_n_files), import uniqueness counts (unique_*,new_unique_*,globally_new_*across lists/entries/communities/pairs/combos at coarse/fine levels), community assignments/descriptions (community_pmi_louv_{coarse,fine},description_pmi_louv_{coarse,fine}), and library diversity metrics (unique_topk_libs, etc.). Allunique*/new*/globally_new*fields are counts (integers).project_by_project_library_bipartite_rca_sbm.csv— Community metadata:Unnamed: 0(row index),community(categorical id),node_count,nodes(list),description_simple,description_detail,project_count,projects(list).project_by_task_from_library_uni_network.csv— Project-to-task community mapping:Unnamed: 0.1,Unnamed: 0(row indices),community(categorical id),library_count,libraries(list),description_simple,description_detail,projects(list),project_count.AI_010_data_uq.parquet— User-quarter panel with commit/activity counts (Cuq,Cuq_mfiles,Cuq_wimprtand log transformslog_*), library diversity counts (libC_all,libE_all,libSE_all,libLE_all,libQC_all,libkC_all,libkE_all,libSC_all,libLC_all,libkQC_all,libSQ_all,libLQ_allplus_new_u/_new_Wvariants), community labels (Cuq_louv_*,Cuq_sbm_*), AI use metrics (AIjoh_win,AIav,AIma{4,8,16,32}_iand laggedL*variants), demographics (gendercategorical,experiencecontinuous), time indices (q,t), and identifiers (IDu,IDu_hash). Shares are fractions 0–1; counts are integers.pyfunctions_ai_classified.parquet— Function-level labels:modified_blocks(code text),user_experience(experience level),true_label(categorical),prediction(model prediction).
Top-level files
AI_010_data_uq.parquet— User-quarter panel used for cross-checks in gender/experience analyses.country_functions.csv— Function-level AI share data with country codes and years (non-US sample).full_countries.csv— Country-level adoption estimates with corrected shares and standard errors by year.functions.csv— Function-level AI share data (US sample); used in diffusion/ttest and panel scripts.hist_data.npz— Pre-binned histogram counts/edges for classifier score plots (Supplementary S2).panel_with_libs_coarse.csv— User-project-quarter commit counts and coarse library metrics for panel construction.project_by_project_library_bipartite_rca_sbm.csv— Project clustering output for RCA/SBM communities.project_by_task_from_library_uni_network.csv— Project-to-task community mapping used for PMI/Louvain communities.pyfunctions_ai_classified.parquet— Labeled Python function dataset with AI vs human predictions and metadata.raw_data_encrypted_final.csv.zip— Core raw dataset of user-level AI shares and demographics.
Model evaluation outputs
newmodels_data/— Pickled score distributions for newer models:claude-sonnet-4-20250514_42_shuffled_{988,997}.pkl,deepseek-V3_{42,1414}_{988,997}.pkl,gpt-4.1_1414_shuffled_{988,997}.pkl,o3_42_{988,997}.pkl.wild_data/— Pickled WildChat evaluation scores:gpt-3.5-turbo_{assist,synth}.pkl,gpt-4_{assist,synth}.pkl.
Library network data (Supplementary S7)
s7_datasets/library_community_network_20_compact.csv— Cleaned library co-occurrence network with communities and descriptions.s7_datasets/temp_lib_co/lib_matrix_124.pkl— Library-by-library co-occurrence matrix.s7_datasets/temp_lib_co/lib_std_124.pkl— Library index (ordered standard library list) aligned with the matrix.s7_datasets/temp_lib_co/node_color_dict_124.pkl— Community color mapping for visualization.s7_datasets/temp_lib_co/node_community_dict_124.pkl— Library-to-community assignments.s7_datasets/temp_lib_co/project_file_libraries_end.csv— Project-level library listings used to build the co-occurrence inputs.
Salary, task, and occupation data (Supplementary S8/S17)
s8_salary_data/Task Statements.xlsx,Task Ratings.xlsx,Scales Reference.xlsx— O*NET task text and rating scales.s8_salary_data/national_M2024_dl.xlsx,national_M2024_dl_pure_data.xlsx— BLS national wage and employment tables.s8_salary_data/ipums_*.csv,ipums_*.pkl,ipums_occupation_soc.xlsx,usa_00003.dat.gz,usa_00003.xml— IPUMS microdata extracts and crosswalks to SOC codes.s8_salary_data/job_task_programming_score*.pkl— LLM-assigned programming-intensity scores (raw, adjusted, percent versions).s8_salary_data/job_ftvector_list.pkl,job_tasklist_ft.pkl,job_tasktype_list.pkl— Task frequency/time vectors and task lists per SOC occupation.s8_salary_data/job_*programming_hour_*.pkl,job_im_weighted_time.pkl,job_imrank_weighted_time.pkl— Derived working-time weights and programming-hour aggregates (BLS and IPUMS variants).s8_salary_data/annual_salary_bls*.pkl,hour_salary_bls*.pkl,employment_count_bls*.pkl— Hourly/annual wage and employment counts with adjusted variants.s8_salary_data/soc_working_time_vector_example_*.pkl— Example working-time vectors under different hour assumptions.s8_salary_data/random_salary_dict_10000.pkl— Monte Carlo salary draws for uncertainty analysis.s8_salary_data/task_content.pkl,task_score_percent.csv— Task text and scoring summaries.s8_salary_data/correlation.xlsx,working_hour_per_year.csv,job_frequency_time.csv— Supporting correlation tables and working-hour assumptions.s8_salary_data/db_29_3_excel/— Full O*NET reference workbooks (Skills, Abilities, Knowledge, Work Activities/Context, Tasks, Technology Skills, etc.) plus accompanying metadata and crosswalk tables.
Code/software
Python
Access information
All data derived from GitHub are based on publicly available repositories accessed via GitHub’s public infrastructure and APIs; no source code or copyrighted repository contents are redistributed, and only derived, aggregated, or model-generated information is stored.
Estimates of wages, employment, and time allocation rely on publicly available occupational and labor statistics—primarily O*NET task surveys, BLS employment and wage data, and the American Community Survey—combined with standard assumptions and model-based inference where quantities are not directly observed.
