Heaps' law and Heaps functions in tagged texts: Evidences of their linguistic relevance
Data files
Feb 28, 2020 version files 30.07 MB
-
1_get_tagging.py
2.05 KB
-
2_get_heaps.py
3.43 KB
-
3_get_N-V_tag.py
964 B
-
4_get_macro.py
1.09 KB
-
5_get_macro_excess.py
1.02 KB
-
6_get_excess_curves.py
945 B
-
aus01.txt
697.82 KB
-
aus02.txt
899.29 KB
-
aus03.txt
686.33 KB
-
aus04.txt
440.78 KB
-
aus05.txt
475.20 KB
-
aus06.txt
898.66 KB
-
aus07.txt
129.82 KB
-
dic01.txt
910.91 KB
-
dic02.txt
162.07 KB
-
dic03.txt
179.65 KB
-
dic04.txt
191.14 KB
-
dic05.txt
587.28 KB
-
dic06.txt
771.32 KB
-
dic07.txt
1.01 MB
-
dic08.txt
541.24 KB
-
dic09.txt
1.97 MB
-
dic10.txt
1.77 MB
-
dic11.txt
1.91 MB
-
dic12.txt
1.45 MB
-
dic13.txt
172.85 KB
-
example.txt
82.74 KB
-
get_sn_v0.out
23.74 KB
-
get_std_v0.out
23.78 KB
-
hux01.txt
82.74 KB
-
hux02.txt
523.32 KB
-
hux03.txt
340.42 KB
-
hux04.txt
119.54 KB
-
hux05.txt
724.25 KB
-
hux06.txt
389.37 KB
-
hux07.txt
900.44 KB
-
hux08.txt
755.97 KB
-
hux09.txt
672.90 KB
-
hux10.txt
78.25 KB
-
hux11.txt
19.70 KB
-
hux12.txt
13.67 KB
-
hux13.txt
9.96 KB
-
hux14.txt
24.43 KB
-
hux15.txt
63.12 KB
-
poe01.txt
41.33 KB
-
poe02.txt
32.50 KB
-
poe03.txt
38.79 KB
-
poe04.txt
16.20 KB
-
poe05.txt
22.55 KB
-
poe06.txt
20.91 KB
-
poe07.txt
22.12 KB
-
poe08.txt
42.80 KB
-
poe09.txt
7.42 KB
-
poe10.txt
13.95 KB
-
poe11.txt
13.25 KB
-
poe12.txt
14.13 KB
-
poe13.txt
11.39 KB
-
poe14.txt
26.46 KB
-
poe15.txt
34.62 KB
-
twa01.txt
903.61 KB
-
twa02.txt
396.98 KB
-
twa03.txt
656.56 KB
-
twa04.txt
363.46 KB
-
twa05.txt
289.99 KB
-
twa06.txt
831.01 KB
-
twa07.txt
96.04 KB
-
twa08.txt
200.33 KB
-
twa09.txt
4.36 KB
-
twa10.txt
6.79 KB
-
twa11.txt
14.96 KB
-
twa12.txt
396.07 KB
-
twa13.txt
578.68 KB
-
twa14.txt
176.40 KB
-
twa15.txt
123.09 KB
-
wel01.txt
179.86 KB
-
wel02.txt
246.47 KB
-
wel03.txt
222.33 KB
-
wel04.txt
324.79 KB
-
wel05.txt
278.23 KB
-
wel06.txt
337.97 KB
-
wel07.txt
386.74 KB
-
wel08.txt
580.39 KB
-
wel09.txt
950.99 KB
-
wel10.txt
454.86 KB
Abstract
We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or ``tags,'' namely, nouns, verbs, and others), and analyze the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the appearance of new words in each text does not obey a power law, and is on the whole well described by the average of random shufflings of the text. Deviations from this average, however, are statistically significant and show systematic trends across the corpus. Specifically, we find that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags add systematically distinct contributions to this tendency, with verbs and others being respectively more and less retarded than the mean trend, and nouns following instead the overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' law, a feature that is still in need of extensive assessment.
The original versions of the corpus files were obtained from the public domain (www.gutenberg.org, www.fadedpage.com) and processed to eliminate spurious text, not belonging to the original works. Computational codes were produced by the authors.
This file collection is a complete version of the analyzed corpus, and of the codes used to perform the analysis.
This dataset contains (1) plain-text files with the texts of 75 literary works in English, used in the analysis of Heaps' law both accross the corpus and within each work, as explained in the manuscript RSOS-200008 submitted to Royal Society Open Science, and (2) six Python codes to perform the Heaps analysis of the texts, plus three auxiliary files used by the codes.
- Chacoma, A.; Zanette, D. H. (2020). Heaps’ Law and Heaps functions in tagged texts: evidences of their linguistic relevance. Royal Society Open Science. https://doi.org/10.1098/rsos.200008
