The Generalized Euclidean Distance (GED) has been extensively used to conduct morphological disparity analyses based on palaeontological matrices of discrete characters. This is in part because some implementations allow the use of morphological matrices with high percentages of missing data without needing to prune taxa for a subsequent ordination of the data set. Previous studies have suggested that this way of using the GED may generate a bias in the resulting morphospace, but a detailed study of this possible effect was still lacking. Here, we test if the percentage of missing data for a taxon artificially influences its position in the morphospace, and if missing data affects pre- and post-ordination disparity measures. We find that this use of the GED creates a systematic bias, whereby taxa with higher percentages of missing data are placed closer to the centre of the morphospace than those with more complete scorings. This bias extends into pre- and post-ordination calculations of disparity measures and can lead to erroneous interpretations of disparity patterns, especially if specimens present in a particular time interval or clade have distinct proportions of missing information. We suggest that this implementation of the GED should be used with caution, especially in cases with high percentages of missing data. Results recovered using an alternative distance measure, Maximum Observed Rescaled Distance (MORD), are more robust to missing data. As a consequence, we suggest that MORD is a more appropriate distance measure than GED when analysing data sets with high amounts of missing data.

Supporting Information 1

Formulae used to calculate the distances and disparity measures.

Supporting Information 2

References for the matrices employed.

Supporting Information 3

Characteristics of the matrices used in this study, along with information regarding the particular analyses in which each was used.

Supporting Information 4

Histograms for the 126 matrices with more than 5% of missing data showing the distribution of the percentage of missing entries per taxon. The vertical dashed line indicates the mean percentage of missing entries.

Supporting Information 5

Histograms showing the percentage of matrices with significant and negative (black), non-significant (light grey), and significant and positive (dark grey) Spearman’s correlations for the GED and the MORD, with α = 0.05. The distances are calculated based on the first 3 PCos and all the PCos. This are the results without any kind of correction for negative eigenvalues. Scatter plots for the 126 matrices with more than 5% of missing data showing the Euclidean distance to the centroid for each taxon against its percentage of missing entries. The distance to the centroid is scaled to 1 independently for each distance measure in order to fit them in the same plot. The red dots correspond to the calculation with GED, while the blue dots refer to those made with MORD. A simple regression line is fitted through each group in order to illustrate the tendency of the data. The number of taxa indicated between brackets corresponds to the untrimmed matrices. The subtitle indicates whether the Spearman’s correlation between the plotted variables is significant and negative (SN), nonsignificant (NS), or significant and positive (SP) at α = 0.05.

Supporting Information 6

Scatter plots for the 126 matrices with more than 5% of missing data showing the Spearman’s correlation coefficient between the Euclidean distance to the centroid for each taxon and its percentage of missing entries against the percentage of explained variance of the PCos used to calculate the distance to the centroid.

Supporting Information 7

Histograms for the 33 matrices with less than 5% of missing data.

Supporting Information 8

Results from the 33 morphological matrices for the studied disparity measures against the proportion of randomly-distributed missing entries added to morphological matrices, calculated from distance matrices generated from the GED and the MORD.

Supporting Information 9

Results of simulations with groups of taxa and different distributions of missing entries for the nine matrices with more than 50 taxa and 50 characters, and less than 5% of missing data.

Supporting Information 10

Description and results of the analysis performed to compare the results of this paper to those of Ciampaglio et al. (2001).

Matrices for Lehmann et al. (2019)

Matrices used in the analyses, saved in nexus format. Separated by folders for each different analysis.

Scripts for Lehmann et al. (2019)

Scripts used for the analyses and creation of figures and supplementary information.

Data from: Biases with the Generalized Euclidean Distance in disparity analyses with high levels of missing data

Data files

Abstract

Supporting Information 1

Supporting Information 2

Supporting Information 3

Supporting Information 4

Supporting Information 5

Supporting Information 6

Supporting Information 7

Supporting Information 8

Supporting Information 9

Supporting Information 10

Matrices for Lehmann et al. (2019)

Scripts for Lehmann et al. (2019)

Data from: Biases with the Generalized Euclidean Distance in disparity analyses with high levels of missing data

Data files

Abstract

Usage notes

Supporting Information 1

Supporting Information 2

Supporting Information 3

Supporting Information 4

Supporting Information 5

Supporting Information 6

Supporting Information 7

Supporting Information 8

Supporting Information 9

Supporting Information 10

Matrices for Lehmann et al. (2019)

Scripts for Lehmann et al. (2019)

Works referencing this dataset