Causal evidence for social group sizes from Wikipedia editing data
Data files
Apr 08, 2024 version files 320.67 KB
-
episodeclusters.dat
-
README.md
-
trust.dat
-
workclusters.dat
Abstract
Human communities have self-organizing properties in which specific Dunbar Numbers may be invoked to explain group attachments. By analyzing Wikipedia editing histories across a wide range of subject pages, we show that there is an emergent coherence in the size of transient groups formed to edit the content of subject texts, with two peaks averaging at around $N=8$ for the size corresponding to maximal contention, and at around $N=4$ as a regular team. These values are consistent with the observed sizes of conversational groups, as well as the hierarchical structuring of Dunbar graphs. We use the Promise Theory model of bipartite trust to derive a scaling law that fits the data and may apply to all group size distributions, when based on attraction to a seeded group process. In addition to providing further evidence that even spontaneous communities of strangers are self-organizing, the results have important implications for the governance of the Wikipedia commons and for the security of all online social platforms and associations.
README: Causal evidence for social group sizes from Wikipedia editing data
https://doi.org/10.5061/dryad.fn2z34v36
This is part of a project to formulate a practical Promise Theory model of trust for our Internet and machine enabled age. It is not related to blockchain or so-called trustless technologies, and is not specifically based on cryptographic techniques. Rather it addresses trustworthiness as an assessment of reliability in keeping specific promises and trust as a tendency to monitor or oversee these processes.
The files measure data grabbed by parsing the history logs of many Wikipedia pages. While looking for evidence of signatures about trust, we discovered evidence of ad hoc group formation in users editing pages, consistent with the Dunbar number hypothesis.
We provide the cache of data used in our paper here, in accordance with procedure, but we encourage anyone to collect data themselves using the code referred to below or their own adaptation of it. The Wikipedia data are continuously observable.
Not all of the columns are used in the analysis.
A full description can be found at:
https://github.com/markburgess/Trustability
Description of the data and file structure
The file trust.dat has columns separated by spaces like this:
output := fmt.Sprintln(
L, // 1 text
LL, // 2
N, // 3 users
NL, // 4
N2, // 5 users-cluster
N2L, // 6
I, // 7 issues
IL, // 8
w, // 9 process work ratio (talk/article)
wL, // 10
u, // 11 mistrust sample work ratio
uL, // 12
mistrust, // 13 s/H
mistrustL, // 14
TG, // 15 av episode duration i.e. group interaction duration
TU, // 16 av episode duration per user
TGL, // 17
TUL, // 18
TU2, // 19 av episode duration per user
TU2L, // 20 av episode duration per user
bot_fraction, // 21 bots/human users
)
Graphs are generated from these data as described in the gnuplot input file
https://github.com/markburgess/Trustability/blob/main/src/gnuplot.in
The contention intensity for mistrust signals:
plot [0:15] "trust.dat" using 3:7
Describe relationships between data files, missing data codes, other abbreviations used. Be as descriptive as possible.
Workcluster and episode cluster graphs are formatted as frequency histograms for representative group sizes during editing "episodes".
.......h := float64(histogram[n])
s := fmt.Sprintf("%f %f %f %f\n",float64(n),float64(h/n_tot),math.Log(float64(n)),math.Log(float64(h/n_tot)))
Again, we refer to the extensive notes at https://github.com/markburgess/Trustability
Sharing/Access information
This is a section for linking to other ways to access the data, and for linking to sources the data is derived from, if any.
Links to other publicly accessible locations of the data:
Data was derived from the following sources:
Code/Software
Additional aggregation and graph formation can be found here
https://github.com/markburgess/Trustability/tree/main/data/GeneratePlots
Methods
Data sets are collected by direct scanning of Wikipedia's open platform data. The data have been processed by code decribed at https://github.com/markburgess/Trustability and documented in detail at http://markburgess.org/trustproject.html