Data and code from: Statistical structure and the evolution of languages
Data files
Jan 21, 2026 version files 31.07 MB
-
README.md
1.38 KB
-
RSPB_replication.zip
31.07 MB
Abstract
Human cultural development is marked by the emergence of new words and ideas, reflecting societal changes. But how does this evolution proceed? We use modern methods in natural language processing (namely, word embeddings) to measure statistical traces of cultural development, providing a testing ground to compare different models as to how this process works. We show that real embeddings of English and 21 other languages exhibit a series of previously unrecognized regularities, specifically (a) frequency assortativity, where entities of high popularity cluster near other high-popularity entities, (b) characteristic clustering velocity profiles due to aggregation into hierarchical structures, (c) persistent temporal dynamics, where newly-created entities appear disproportionately near other recent entries, and (d) Taylor’s law, implying that over time and across empirical semantic space the variance in new entity counts scales as a power of the mean, which helps systematize and quantify large historical fluctuations of neologisms. To explain these facts, we propose a class of generative models (specifically, directed preferential placement) that construct synthetic embeddings exhibiting similar regularities. We show that analogous regularities also occur in other data sets, suggesting that such generating models may shed light on new aspects of language and cultural evolution.
Key Functions Overview
This repository contains key functions for generating and analyzing embedding models:
To reproduce Figure/Tables
To reproduce all figures/tables: go to ./src/reproduce-results.ipynb and run all cells sequentially.
All the figures and tables can be reproduced from the cache result files in ./data
Core Model Generation (src/gen_models.py)
gen_model_gaussian()- Standard Gaussian embedding generationgen_model_mixture_gaussian()- Mixture Gaussian model with clustering using make_blobsgen_model_uniform()- Uniform distribution embedding generationgen_model_uniform_directional()- Directional preferential placement with spherical coordinatesgen_model_preferential_placement_v2()- Preferential placement model with exponential radiusgen_parameterized_preferential_placement_v2()- Parameterized preferential placement with VMF sampling and multiple optionsgen_model_preferential_placement_recency()- Parameterized preferential placement with recency-based decaygen_model_embed()- Main interface for generating embeddings with specified model and parameters
Model Configuration (src/gen_models.py)
-
get_model_results_key_label()- Returns model configurations and parameter combinations for embedding models
