Computational phylogenetics reveal the history of sign languages
Data files
Feb 05, 2024 version files 895.55 KB
-
README.md
-
SignLanguagesPhylogeny-main.zip
Abstract
Sign languages are naturally occurring languages. As such, their emergence and spread reflect the histories of their communities. However, limitations in historical recordkeeping and linguistic documentation have hindered diachronic analysis of sign languages. Here, we use computational phylogenetic methods to study family structure among 19 sign languages from deaf communities worldwide. We use phonologically coded lexical data from contemporary languages to infer relatedness, and suggest these methods can help study regular form changes in sign languages. The inferred trees are consistent in key respects with known historical information, but challenge certain assumed groupings and surpass analyses made available by traditional methods. Moreover, the phylogenetic inferences are not reducible to geographic distribution, but do affirm the importance of geopolitical forces in the histories of human languages.
README: Phylogenies with Matricial Datasets
This is the implementation of the numerical methods described in the paper "Computational phylogenetics reveal histories of sign languages". Instructions are given below to reproduce all the results from the paper, and to re-use the methodology on other data sets.
The code uses R version 4.1.2, with the "parallel" library. The code is also working on 4.1.2.
All files and code are compressed in the ZIP (including a copy of the README); files are also available at https://github.com/GClarte/SignLanguagesPhylogeny.
functions
This folder contains all the core functions needed to run the analysis. Users wishing to reproduce the results from the paper, or to make small changes (parameter values, different data) will not need to open these files. These files may be of use to researchers wishing to make more substantial changes to the model. In case of need, you can contact Grégoire Clarté.
datasets
This folder contains the datasets used for the experiments presented in the papers. The data consists in a csv file, where each line corresponds to a word in a language. The first columns give the meaning and the language; the other columns give all the characters used to encode the word. We also include an excel file containing exactly the same data.
Details of the coding system can be found in the supplementary materials of the associated paper. Cell values of "UNDEF" or "NA" indicate that the coding field was not applicable to the sign. For example, a sign that is produced in neutral signing space does not receive a code for body part.
Results
Contains the resulting samples of trees (EBNZ_1.nex, EBNZ_2.nex and Asia.nex), and the resulting consensus trees (EBNZ_1_annote.nex, EBNZ_2_annote.nex and Asia_annote.nex). We did not include the full output of the SMC as the files are too large; these files can be reproduced using the code below.
Reproducing results from the paper
R scripts are available at the root of the folder. They correspond to different analyses:
- AsianSL.R corresponds to the code for the study of the Asian dataset
- EuropeanSL.R corresponds to the code for the study of the European sign languages (with New Zealand sign language).
- AllSL.R includes Asian, European and New Zealand sign languages. (This last analysis gives trees which should not be reused, as it is based upon the unwarranted assumptions that all sign languages belong to a single tree.)
To reproduce the results of the paper, execute the relevant R script. We recommend using a large cluster as the running time is about a day with 40 cores; the number of cores can be changed in the last lines of each script.
In all cases, you should then execute the PostProcessing.R script, which post-processes the output and saves the trees in the Nexus format for interpretation in standard phylogenetic software.
Users wishing to make slight modifications to the analyses will presumably be particularly interested in changing the following parameter values:
- prior information on ages are set in the object "Contraintesages"
- the set of characters used is given by object "qui"
- the set of languages included is set by object "quelleslangues"
Using the code on other data sets
The script GenericTemplate.R gives a template which can be adapted to apply the method to other data. To use this script, re-users will need to fill out certain parameters, which are marked explicitly. This script goes through the whole process: formatting the dataset for the inference, setting of the parameters, launching of the SMC, description of the output.
Here too, you should then execute the PostProcessing.R script. It plots all the parameters and produces the .nex files needed for subsequent phylogenetic analyses. The resulting .nex files can be fed to standard phylogenetic software.
Thanks
The implementation of the Dirichlet distribution comes from the gtools package.
Methods
Data set comprises coded vocabulary data from 19 sign languages. Vocabulary items were sourced from freely available online sign language dictionaries and were annotated using a web-based interface developed for the project. The categories and category values used in the coding system are compatible with and informed by leading contemporary theories of sign language phonology. Additional information about data collection and coding is available in Section 2 and Section 4 of the supplementary materials text.
Usage notes
Proprietary and open-source software capable of working with .CSV files, including Microsoft Excel and Google Sheets.