This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Step 1. Input the data, look at the variables, remove constant or otherwise doubtfully useful variables
Remember that variables that may be constant in this data set could become useful in a larger data set where they vary more.
When editing the data file in Excel prior to importing the data into R, use the following formula to make a version substituting “NA� in the blank cells: =IF(ISBLANK(cell reference),“NA�, cell reference)
Read in the data to R from a text file:
martinx<-read.table("Cyperus margaritaceus dataset2_SimoneditSep22.txt", header=TRUE)
Look at the data to see if there are constant variables and missing values. Get a general idea by looking at the column names:
names(martinx)
[1] "Specimen" "Taxon" "Lflen" "Lfwid" "Cullen" "Culwid"
[7] "Numbrac" "Lenbrac" "Numspk" "Spklen" "Spkwid" "Gllen"
[13] "Glwid" "Nutlen" "Nutwid" "Numgl" "Bshpers" "Bshtext"
[19] "Bshglos"
Look at the first six lines with function “head�
head(martinx)
NA
Look to see how many missing values there are (NA) with function “summary�:
summary(martinx)
Specimen Taxon Lflen
M.Swa.1592 : 2 margaritaceus :201 Min. : 1.50
M.Za.9883 : 2 obtusiflorus : 72 1st Qu.:10.12
Nd.Za.1450 : 2 sphaerocephalus: 64 Median :16.00
Sph.Sa.18 : 2 nduru : 60 Mean :17.25
ka.Bu.21193: 1 tisserantii : 47 3rd Qu.:22.50
ka.Gh.1975 : 1 karlschumannii : 14 Max. :90.00
(Other) :479 (Other) : 31 NA's :22
Lfwid Cullen Culwid Numbrac
Min. : 0.200 Min. : 3.50 Min. :0.400 Min. :1.000
1st Qu.: 0.600 1st Qu.: 18.12 1st Qu.:0.800 1st Qu.:2.000
Median : 0.900 Median : 27.75 Median :1.000 Median :3.000
Mean : 1.131 Mean : 31.62 Mean :1.092 Mean :3.022
3rd Qu.: 1.500 3rd Qu.: 41.35 3rd Qu.:1.300 3rd Qu.:3.500
Max. :13.000 Max. :115.70 Max. :2.500 Max. :6.000
NA's :6 NA's :10 NA's :3 NA's :2
Lenbrac Numspk Spklen Spkwid
Min. : 0.160 Min. : 1.000 Min. : 4.000 Min. : 2.000
1st Qu.: 2.500 1st Qu.: 3.000 1st Qu.: 8.815 1st Qu.: 4.300
Median : 3.750 Median : 5.000 Median :10.650 Median : 5.200
Mean : 4.244 Mean : 6.964 Mean :11.988 Mean : 5.414
3rd Qu.: 5.500 3rd Qu.: 9.000 3rd Qu.:13.800 3rd Qu.: 6.500
Max. :25.000 Max. :40.000 Max. :40.000 Max. :10.000
NA's :17 NA's :7 NA's :3 NA's :2
Gllen Glwid Nutlen Nutwid
Min. : 2.500 Min. :1.000 Min. :1.000 Min. :0.500
1st Qu.: 4.500 1st Qu.:1.500 1st Qu.:1.800 1st Qu.:1.200
Median : 5.000 Median :2.000 Median :2.100 Median :1.400
Mean : 5.247 Mean :2.096 Mean :2.104 Mean :1.374
3rd Qu.: 6.000 3rd Qu.:2.500 3rd Qu.:2.400 3rd Qu.:1.500
Max. :10.000 Max. :4.500 Max. :3.500 Max. :2.500
NA's :3 NA's :2 NA's :223 NA's :223
Numgl Bshpers Bshtext Bshglos
Min. : 1.3 fibrous : 19 firm : 53 dull :357
1st Qu.: 8.5 flattened:457 hard : 70 shiny: 99
Median :11.0 NA's : 13 papery:335 NA's : 33
Mean :12.4 NA's : 31
3rd Qu.:14.6
Max. :41.5
NA's :2
The “summary� function also summary statistics for each variable.
Two characters have very large proportions of missing values and for these interpolation is inadvisable and so they have to be deleted (nutlet length and width):
martinx.edit<-martinx[,-c(14:15)]
names(martinx.edit)
[1] "Specimen" "Taxon" "Lflen" "Lfwid" "Cullen" "Culwid"
[7] "Numbrac" "Lenbrac" "Numspk" "Spklen" "Spkwid" "Gllen"
[13] "Glwid" "Numgl" "Bshpers" "Bshtext" "Bshglos"
Step 2. Imputation of missing variables
Now we can interpolate values for the missing data using the “missForest� package. This will provide a complete matrix for analysis. This package handles mixed continuous and categorical data sets, and is non-parametric.
Load the “missForest� package:
library(missForest)
Loading required package: randomForest
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Loading required package: foreach
Loading required package: itertools
Loading required package: iterators
Delete the first two columns of factors (specimen and taxon names), set seed and carry out the interpolation with the function “missForest�.
martinx.edit.2<-martinx.edit[,-c(1:2)]
set.seed(810)
martinx.edit.2.imp<-missForest(martinx.edit.2, verbose = TRUE)
missForest iteration 1 in progress...done!
estimated error(s): 0.3799116 0.08545668
difference(s): 0.005567988 0.00408998
time: 1.81 seconds
missForest iteration 2 in progress...done!
estimated error(s): 0.3834789 0.08692505
difference(s): 0.000424447 0.0006816633
time: 1.76 seconds
missForest iteration 3 in progress...done!
estimated error(s): 0.3766948 0.07962788
difference(s): 7.394146e-05 0.0006816633
time: 1.76 seconds
missForest iteration 4 in progress...done!
estimated error(s): 0.374215 0.09202605
difference(s): 0.0001042416 0.0006816633
time: 1.75 seconds
Look at the structure of the output object “martinx.edit.2.imp�:
str(martinx.edit.2.imp)
List of 2
$ ximp :'data.frame': 489 obs. of 15 variables:
..$ Lflen : num [1:489] 21 12 11.5 19 24 16 21 17.5 36.5 20 ...
..$ Lfwid : num [1:489] 1.5 1 0.7 1 0.75 1 2.5 1.2 0.7 1.5 ...
..$ Cullen : num [1:489] 49 33 27 28 28 40 40.8 27.3 54.5 62.2 ...
..$ Culwid : num [1:489] 0.8 1 0.6 1 0.8 1 1.8 0.9 1.1 1.3 ...
..$ Numbrac: num [1:489] 3 2 2 3 2.5 2 4 3 2 3 ...
..$ Lenbrac: num [1:489] 4.8 3.5 3.44 5.5 12.5 ...
..$ Numspk : num [1:489] 2.5 1.5 2 3.5 2.5 4.5 6 5 2 4 ...
..$ Spklen : num [1:489] 16.2 10.1 15 14.6 9.8 16 14.8 13 19.8 14.5 ...
..$ Spkwid : num [1:489] 5.8 6.8 7 5.7 5.6 6.25 6.6 5.2 5.3 6.6 ...
..$ Gllen : num [1:489] 5.7 4.5 5.5 5.5 6 6.2 5.4 5 5 5.5 ...
..$ Glwid : num [1:489] 3 3 2.5 2.5 4 3.2 3 3 3 4 ...
..$ Numgl : num [1:489] 12.7 16.7 17 14.8 10 ...
..$ Bshpers: Factor w/ 2 levels "fibrous","flattened": 2 2 2 2 2 2 2 2 2 2 ...
..$ Bshtext: Factor w/ 3 levels "firm","hard",..: 3 3 3 3 3 3 3 3 3 3 ...
..$ Bshglos: Factor w/ 2 levels "dull","shiny": 1 1 1 1 1 1 1 1 1 1 ...
$ OOBerror: Named num [1:2] 0.3767 0.0796
..- attr(*, "names")= chr [1:2] "NRMSE" "PFC"
- attr(*, "class")= chr "missForest"
The completed data frame with imputed values is the object “martinx.edit.2.imp$ximp�:
head(martinx.edit.2.imp$ximp)
NA
Now we add back the factor columns we removed (columns 1 and 2) and give the imputed matrix a simpler name:
martinx.imp<-data.frame(martinx.edit[, c(1:2)], martinx.edit.2.imp$ximp)
head(martinx.imp)
NA
Write the new imputed matrix back into Excel so it can be saved separately. This script writes the file into the working directory:
write.csv(martinx.imp, file="Cyperus margaritaceus imputed.csv", row.names = FALSE)
Now have a careful look at the Excel file to see if there are any peculiarities that need to be addressed. Compare with the original Excel file.
Step 3. Principal Coordinate Analysis (PCoA
Because the data frame has both quantitative variable and qualitative characters, we will use Gower’s distance coefficient to compute a distance matrix.
The cells of the distance matrix show the Gower’s morphological distance between every pair of individuals.
The PCoA is a computational approach which attempts to represent these distances as faithfully as possible using orthogonal axes (analogous to PCA axes).
To compute the Gower distance matrix we use the function daisy from the package cluster.
library(cluster)
The computation of Gower’s distance standardizes the variables, so there is no need to standardize them first.
martin.gower<-daisy(martinx.imp[-c(1:2)], metric="gower")
Carry out the PCoA using the function cmdscale in the “stats� package that is loaded automatically with R:
pcoa<-cmdscale(martin.gower, k=10, eig=TRUE)
summary(pcoa)
Length Class Mode
points 4890 -none- numeric
eig 489 -none- numeric
x 0 -none- NULL
ac 1 -none- numeric
GOF 2 -none- numeric
The object “pcoa$points� has the coordinates of the scores on each of the principal coordinates (PCos). The call was for 10 PCos, which means that the scores for the first 10 PCos are printed.
The object eig gives the values of all 414 eigenvalues in descending order. There are as many eigenvalues as objects in the distance matrix. Printing eig shows how many are positive and how many negative. Only positive eigenvalues (and their corresponding PCos) are meaningful.
pcoa$eig
[1] 4.244259e+00 3.173968e+00 1.065240e+00 7.932528e-01
[5] 5.628640e-01 4.392599e-01 3.959920e-01 3.440904e-01
[9] 3.037265e-01 2.503792e-01 2.319920e-01 2.168423e-01
[13] 1.927479e-01 1.646096e-01 1.517877e-01 1.470159e-01
[17] 1.390535e-01 1.273764e-01 1.006129e-01 9.652160e-02
[21] 9.434178e-02 8.596209e-02 8.103600e-02 7.937024e-02
[25] 7.686824e-02 7.128964e-02 6.310731e-02 6.209858e-02
[29] 5.759248e-02 5.491411e-02 5.299950e-02 4.687674e-02
[33] 4.466907e-02 4.287627e-02 3.995105e-02 3.758260e-02
[37] 3.694669e-02 3.393739e-02 3.251374e-02 3.176206e-02
[41] 2.909629e-02 2.865197e-02 2.573298e-02 2.516563e-02
[45] 2.454557e-02 2.409919e-02 2.295154e-02 2.199292e-02
[49] 2.023421e-02 1.952380e-02 1.867106e-02 1.797300e-02
[53] 1.663856e-02 1.569946e-02 1.521185e-02 1.462230e-02
[57] 1.306318e-02 1.266385e-02 1.242249e-02 1.175119e-02
[61] 1.131447e-02 1.053885e-02 1.008696e-02 9.471585e-03
[65] 9.118144e-03 8.415913e-03 8.088253e-03 7.899138e-03
[69] 7.527452e-03 7.496477e-03 7.137608e-03 6.960198e-03
[73] 6.377337e-03 5.884372e-03 5.533017e-03 5.454866e-03
[77] 5.011755e-03 4.903804e-03 4.859934e-03 4.535768e-03
[81] 4.416824e-03 4.231235e-03 4.137643e-03 3.841396e-03
[85] 3.427142e-03 3.378176e-03 3.017528e-03 2.918212e-03
[89] 2.722487e-03 2.609034e-03 2.518789e-03 2.271621e-03
[93] 2.148236e-03 2.051238e-03 1.887642e-03 1.735942e-03
[97] 1.608233e-03 1.539720e-03 1.443297e-03 1.366625e-03
[101] 1.302038e-03 1.210318e-03 1.112911e-03 1.043919e-03
[105] 9.965999e-04 8.862354e-04 8.216288e-04 6.964927e-04
[109] 6.206712e-04 5.232006e-04 4.762193e-04 3.952441e-04
[113] 3.785104e-04 2.986538e-04 2.671530e-04 2.113794e-04
[117] 1.534363e-04 1.093472e-04 7.669815e-05 2.158849e-05
[121] 2.775558e-17 -2.550674e-05 -6.523860e-05 -9.116807e-05
[125] -1.431669e-04 -1.873903e-04 -2.031885e-04 -2.300432e-04
[129] -2.443387e-04 -2.972826e-04 -3.056029e-04 -3.365362e-04
[133] -3.777453e-04 -4.112787e-04 -4.447016e-04 -4.649513e-04
[137] -4.745252e-04 -4.881963e-04 -5.226172e-04 -5.592212e-04
[141] -5.766836e-04 -6.163620e-04 -6.203184e-04 -6.478443e-04
[145] -6.555357e-04 -6.855215e-04 -7.044100e-04 -7.081698e-04
[149] -7.303650e-04 -7.390374e-04 -7.476732e-04 -7.575619e-04
[153] -7.840543e-04 -8.060396e-04 -8.174363e-04 -8.382559e-04
[157] -8.497114e-04 -8.618062e-04 -8.729267e-04 -9.013596e-04
[161] -9.062107e-04 -9.159225e-04 -9.411694e-04 -9.610851e-04
[165] -9.755973e-04 -1.010917e-03 -1.017257e-03 -1.046977e-03
[169] -1.051747e-03 -1.061857e-03 -1.091973e-03 -1.097687e-03
[173] -1.107245e-03 -1.131032e-03 -1.144608e-03 -1.165058e-03
[177] -1.187734e-03 -1.200830e-03 -1.210359e-03 -1.224088e-03
[181] -1.229099e-03 -1.264417e-03 -1.275568e-03 -1.285518e-03
[185] -1.315358e-03 -1.335821e-03 -1.344381e-03 -1.359120e-03
[189] -1.366938e-03 -1.379697e-03 -1.387230e-03 -1.418340e-03
[193] -1.426083e-03 -1.456446e-03 -1.471890e-03 -1.477788e-03
[197] -1.504407e-03 -1.516667e-03 -1.521371e-03 -1.536069e-03
[201] -1.564475e-03 -1.580368e-03 -1.585794e-03 -1.621367e-03
[205] -1.624634e-03 -1.638950e-03 -1.650526e-03 -1.680749e-03
[209] -1.701321e-03 -1.713365e-03 -1.751104e-03 -1.768158e-03
[213] -1.790585e-03 -1.831733e-03 -1.838004e-03 -1.844192e-03
[217] -1.873476e-03 -1.895872e-03 -1.912144e-03 -1.945025e-03
[221] -1.950108e-03 -1.967734e-03 -1.984179e-03 -1.995005e-03
[225] -2.010586e-03 -2.029487e-03 -2.031994e-03 -2.053296e-03
[229] -2.074300e-03 -2.087567e-03 -2.125527e-03 -2.135541e-03
[233] -2.189118e-03 -2.190727e-03 -2.198023e-03 -2.214908e-03
[237] -2.230486e-03 -2.268535e-03 -2.291847e-03 -2.312243e-03
[241] -2.329168e-03 -2.348215e-03 -2.361153e-03 -2.402889e-03
[245] -2.427377e-03 -2.440798e-03 -2.445314e-03 -2.484042e-03
[249] -2.493179e-03 -2.521582e-03 -2.560208e-03 -2.582064e-03
[253] -2.624078e-03 -2.650615e-03 -2.676032e-03 -2.699435e-03
[257] -2.727183e-03 -2.748361e-03 -2.774237e-03 -2.801458e-03
[261] -2.838893e-03 -2.851463e-03 -2.866315e-03 -2.894944e-03
[265] -2.902159e-03 -2.918049e-03 -2.981492e-03 -3.003366e-03
[269] -3.037144e-03 -3.060796e-03 -3.084182e-03 -3.095895e-03
[273] -3.128297e-03 -3.173526e-03 -3.194930e-03 -3.205263e-03
[277] -3.241370e-03 -3.273891e-03 -3.305683e-03 -3.336924e-03
[281] -3.343132e-03 -3.375187e-03 -3.383750e-03 -3.429806e-03
[285] -3.473284e-03 -3.503079e-03 -3.531455e-03 -3.553633e-03
[289] -3.598149e-03 -3.617062e-03 -3.657897e-03 -3.673377e-03
[293] -3.745989e-03 -3.759814e-03 -3.800429e-03 -3.821012e-03
[297] -3.866854e-03 -3.900946e-03 -3.929614e-03 -3.955516e-03
[301] -3.994187e-03 -4.051024e-03 -4.107279e-03 -4.118227e-03
[305] -4.171557e-03 -4.209810e-03 -4.223324e-03 -4.283607e-03
[309] -4.340416e-03 -4.366020e-03 -4.378551e-03 -4.440503e-03
[313] -4.482116e-03 -4.547933e-03 -4.564028e-03 -4.612708e-03
[317] -4.649907e-03 -4.698543e-03 -4.710476e-03 -4.788004e-03
[321] -4.800426e-03 -4.843616e-03 -4.877196e-03 -4.931099e-03
[325] -4.967732e-03 -5.025881e-03 -5.072611e-03 -5.151579e-03
[329] -5.189556e-03 -5.231932e-03 -5.276806e-03 -5.332301e-03
[333] -5.369760e-03 -5.431945e-03 -5.480724e-03 -5.517574e-03
[337] -5.620018e-03 -5.677121e-03 -5.760451e-03 -5.792143e-03
[341] -5.842003e-03 -5.862578e-03 -5.970124e-03 -5.987550e-03
[345] -6.073581e-03 -6.167259e-03 -6.185092e-03 -6.244635e-03
[349] -6.265631e-03 -6.407388e-03 -6.415749e-03 -6.465985e-03
[353] -6.495097e-03 -6.590136e-03 -6.670360e-03 -6.710346e-03
[357] -6.789286e-03 -6.887198e-03 -6.911581e-03 -6.942788e-03
[361] -7.049548e-03 -7.089427e-03 -7.183878e-03 -7.226286e-03
[365] -7.315986e-03 -7.488770e-03 -7.526437e-03 -7.679019e-03
[369] -7.710868e-03 -7.757474e-03 -7.932557e-03 -7.957283e-03
[373] -8.092033e-03 -8.168503e-03 -8.230399e-03 -8.323442e-03
[377] -8.438209e-03 -8.472540e-03 -8.546880e-03 -8.669975e-03
[381] -8.873451e-03 -8.910139e-03 -8.917740e-03 -9.112978e-03
[385] -9.262132e-03 -9.329742e-03 -9.416222e-03 -9.541235e-03
[389] -9.636501e-03 -9.789663e-03 -9.994689e-03 -1.001568e-02
[393] -1.007838e-02 -1.033799e-02 -1.035657e-02 -1.051687e-02
[397] -1.072252e-02 -1.086099e-02 -1.101685e-02 -1.106863e-02
[401] -1.120132e-02 -1.143265e-02 -1.162336e-02 -1.168029e-02
[405] -1.182179e-02 -1.201426e-02 -1.209987e-02 -1.230504e-02
[409] -1.238868e-02 -1.252550e-02 -1.275634e-02 -1.296922e-02
[413] -1.323681e-02 -1.335526e-02 -1.342488e-02 -1.368185e-02
[417] -1.398655e-02 -1.420258e-02 -1.452596e-02 -1.475708e-02
[421] -1.484249e-02 -1.513289e-02 -1.536927e-02 -1.574063e-02
[425] -1.601145e-02 -1.616073e-02 -1.640332e-02 -1.660931e-02
[429] -1.688632e-02 -1.718898e-02 -1.779875e-02 -1.796789e-02
[433] -1.821736e-02 -1.845657e-02 -1.894686e-02 -1.901240e-02
[437] -1.949613e-02 -1.997751e-02 -2.008348e-02 -2.058428e-02
[441] -2.081246e-02 -2.119970e-02 -2.195006e-02 -2.210595e-02
[445] -2.271940e-02 -2.347715e-02 -2.377401e-02 -2.443264e-02
[449] -2.485684e-02 -2.590615e-02 -2.627284e-02 -2.649268e-02
[453] -2.767302e-02 -2.845276e-02 -2.899381e-02 -2.970586e-02
[457] -3.033085e-02 -3.050061e-02 -3.223162e-02 -3.260044e-02
[461] -3.337235e-02 -3.447256e-02 -3.538942e-02 -3.902985e-02
[465] -4.038473e-02 -4.101068e-02 -4.184464e-02 -4.251955e-02
[469] -4.683444e-02 -4.897365e-02 -4.990785e-02 -5.165909e-02
[473] -5.692140e-02 -6.100464e-02 -6.192402e-02 -7.008225e-02
[477] -7.232298e-02 -7.355198e-02 -7.848086e-02 -8.661082e-02
[481] -9.579114e-02 -9.773078e-02 -1.105718e-01 -1.230960e-01
[485] -1.445094e-01 -1.605858e-01 -2.264697e-01 -2.526160e-01
[489] -5.938851e-01
Only the first 121 eigenvalues are positive. We need to visualize them:
barplot(pcoa$eig[1:121])
It would be more useful to show these as percentages:
eig.percent<-(pcoa$eig[1:121]*100)/sum(pcoa$eig[1:121])
round(eig.percent, 3)
[1] 28.541 21.344 7.163 5.334 3.785 2.954 2.663 2.314 2.042
[10] 1.684 1.560 1.458 1.296 1.107 1.021 0.989 0.935 0.857
[19] 0.677 0.649 0.634 0.578 0.545 0.534 0.517 0.479 0.424
[28] 0.418 0.387 0.369 0.356 0.315 0.300 0.288 0.269 0.253
[37] 0.248 0.228 0.219 0.214 0.196 0.193 0.173 0.169 0.165
[46] 0.162 0.154 0.148 0.136 0.131 0.126 0.121 0.112 0.106
[55] 0.102 0.098 0.088 0.085 0.084 0.079 0.076 0.071 0.068
[64] 0.064 0.061 0.057 0.054 0.053 0.051 0.050 0.048 0.047
[73] 0.043 0.040 0.037 0.037 0.034 0.033 0.033 0.031 0.030
[82] 0.028 0.028 0.026 0.023 0.023 0.020 0.020 0.018 0.018
[91] 0.017 0.015 0.014 0.014 0.013 0.012 0.011 0.010 0.010
[100] 0.009 0.009 0.008 0.007 0.007 0.007 0.006 0.006 0.005
[109] 0.004 0.004 0.003 0.003 0.003 0.002 0.002 0.001 0.001
[118] 0.001 0.001 0.000 0.000
Still more useful is a cumulative sum of the eigenvalues so we can see where a good cut-off point is:
eig.percent.cumul<-cumsum(eig.percent)
eig.percent.cumul
[1] 28.54149 49.88557 57.04902 62.38343 66.16854 69.12244
[7] 71.78538 74.09929 76.14177 77.82551 79.38559 80.84379
[13] 82.13997 83.24693 84.26766 85.25630 86.19140 87.04797
[19] 87.72456 88.37364 89.00807 89.58614 90.13108 90.66483
[25] 91.18175 91.66115 92.08553 92.50312 92.89042 93.25970
[31] 93.61611 93.93134 94.23173 94.52006 94.78872 95.04145
[37] 95.28991 95.51813 95.73678 95.95037 96.14603 96.33871
[43] 96.51176 96.68099 96.84605 97.00811 97.16245 97.31035
[49] 97.44642 97.57771 97.70327 97.82413 97.93602 98.04160
[55] 98.14389 98.24222 98.33007 98.41523 98.49877 98.57779
[61] 98.65388 98.72475 98.79258 98.85628 98.91759 98.97419
[67] 99.02858 99.08170 99.13232 99.18273 99.23073 99.27753
[73] 99.32042 99.35999 99.39720 99.43388 99.46758 99.50056
[79] 99.53324 99.56374 99.59345 99.62190 99.64973 99.67556
[85] 99.69860 99.72132 99.74161 99.76124 99.77955 99.79709
[91] 99.81403 99.82931 99.84375 99.85755 99.87024 99.88191
[97] 99.89273 99.90308 99.91279 99.92198 99.93073 99.93887
[103] 99.94636 99.95338 99.96008 99.96604 99.97156 99.97625
[109] 99.98042 99.98394 99.98714 99.98980 99.99235 99.99435
[115] 99.99615 99.99757 99.99860 99.99934 99.99985 100.00000
[121] 100.00000
We need 36 eigenvalues to capture 95% of the variance:
barplot(pcoa$eig[1:36])
What this means is that the ordinations of the first two principal coordinates will only express 49.9% of the total variation.
We might need to try Non-Metric Multidimensional Scaling (NMDS)to improve on this - it finds the best 2 dimensional plot by an iteration process.
Plot the PCoA ordinations:
scores<-pcoa$points
plot(scores[,1],scores[,2], pch=16, cex=0.8, col=c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey")[martinx.imp[,2]], asp=1, xlab="PCo1", ylab="PCo2", main="PCoA Cyperus margaritaceus complex", cex.main=0.8)
legend("bottomleft", levels(martinx.imp[,2]), pch=16, col=c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), cex=0.8, pt.cex = 1.0)
NA
NA
Carry out a Principal Component Analysis (PCA) using the quantitative variables on the imputed data set martinx.imp
Run the PCA
martinx.imp.pca<-prcomp(martinx.imp[,3:14], scale=TRUE)
summary(martinx.imp.pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 2.1980 1.4456 1.03308 0.92074 0.89322 0.76397
Proportion of Variance 0.4026 0.1741 0.08894 0.07065 0.06649 0.04864
Cumulative Proportion 0.4026 0.5767 0.66567 0.73631 0.80280 0.85144
PC7 PC8 PC9 PC10 PC11 PC12
Standard deviation 0.7039 0.66439 0.51426 0.50551 0.46846 0.32604
Proportion of Variance 0.0413 0.03679 0.02204 0.02129 0.01829 0.00886
Cumulative Proportion 0.8927 0.92952 0.95156 0.97285 0.99114 1.00000
Compute the number of significant principal components using the function “evplot.R� which is obtainable via the Springer book web page for Borcard et al’s “Numerical Ecology� (Borcard et al., 2011)
First run the function:
evplot() Plot eigenvalues and percentages of variation Kaiser rule and broken stick model License: GPL-2 Author: Francois Gillet, Octber 2007
evplot = function(ev) {
# Broken stick model (MacArthur 1957)
n = length(ev)
bsm = data.frame(j=seq(1:n), p=0)
bsm$p[1] = 1/n
for (i in 2:n) {
bsm$p[i] = bsm$p[i-1] + (1/(n + 1 - i))
}
bsm$p = 100*bsm$p/n
# Plot eigenvalues and % of variation for each axis
op = par(mfrow=c(2,1))
barplot(ev, main="Eigenvalues", col="bisque", las=2)
abline(h=mean(ev), col="red")
legend("topright", "Average eigenvalue", lwd=1, col=2, bty="n")
barplot(t(cbind(100*ev/sum(ev), bsm$p[n:1])), beside=T, main="% variation", col=c("bisque",2), las=2)
legend("topright", c("% eigenvalue", "Broken stick model"),pch=15, col=c("bisque",2), bty="n")
par(op)
}
Use the function:
ev<-(martinx.imp.pca$sdev)^2
evplot(ev)
This shows significance for the first 3 PCs, which express 66.6% of total variance.
Plot the first three PCs:
par(mfrow=c(1,1))
plot(martinx.imp.pca$x[,1], martinx.imp.pca$x[,2],pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey") [martinx.imp[,2]], cex=1, asp=1)
legend("bottomleft",inset=0.025, legend=c("karlschumannii","ledermannii","margaritaceus", "nduru","niveus","obtusiflorus","somaliensis", "sphaerocephalus","tisserantii"),pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), cex=0.7)
par(mfrow=c(1,1))
plot(martinx.imp.pca$x[,1], martinx.imp.pca$x[,3],pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey") [martinx.imp[,2]], cex=1, asp=1)
legend("bottomleft",inset=0.025, legend=c("karlschumannii","ledermannii","margaritaceus", "nduru","niveus","obtusiflorus","somaliensis", "sphaerocephalus","tisserantii"),pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), cex=0.7)
NA
NA
par(mfrow=c(1,1))
plot(martinx.imp.pca$x[,2], martinx.imp.pca$x[,3],pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey") [martinx.imp[,2]], cex=1, asp=1)
legend("bottomleft",inset=0.025, legend=c("karlschumannii","ledermannii","margaritaceus", "nduru","niveus","obtusiflorus","somaliensis", "sphaerocephalus","tisserantii"),pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), cex=0.7)
NA
NA
Replot with convex hulls
To plot convex hulls around the species groups we can use the function Plot ConvexHull from Ken Takagi’s website Chit Chat R to overplot convex hulls onto the original plot.
### Plotting function to plot convex hulls
### Filename: Plot_ConvexHull.R
### Notes:
############################################################################
# INPUTS:
# xcoords: x-coordinates of point data
# ycoords: y-coordinates of point data
# lcolor: line color
# OUTPUTS:
# convex hull around data points in a particular color (specified by lcolor)
# FUNCTION:
Plot_ConvexHull<-function(xcoord, ycoord, lcolor){
hpts <- chull(x = xcoord, y = ycoord)
hpts <- c(hpts, hpts[1])
lines(xcoord[hpts], ycoord[hpts], col = lcolor)
}
# END OF FUNCTION
par(mfrow=c(1,1))
plot(martinx.imp.pca$x[,1], martinx.imp.pca$x[,2],pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey") [martinx.imp[,2]], cex=1, asp=1)
legend("topleft",inset=0.025, legend=c("karlschumannii","ledermannii","margaritaceus", "nduru","niveus","obtusiflorus","somaliensis", "sphaerocephalus","tisserantii"),pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), cex=0.7)
Plot_ConvexHull(martinx.imp.pca$x[,1][martinx.imp[,2]=="karlschumannii"],martinx.imp.pca$x[,2][martinx.imp[,2]=="karlschumannii"], lcolor = "red")
Plot_ConvexHull(martinx.imp.pca$x[,1][martinx.imp[,2]=="ledermannii"],martinx.imp.pca$x[,2][martinx.imp[,2]=="ledermannii"], lcolor = "blue")
Plot_ConvexHull(martinx.imp.pca$x[,1][martinx.imp[,2]=="margaritaceus"],martinx.imp.pca$x[,2][martinx.imp[,2]=="margaritaceus"], lcolor = "green")
Plot_ConvexHull(martinx.imp.pca$x[,1][martinx.imp[,2]=="nduru"],martinx.imp.pca$x[,2][martinx.imp[,2]=="nduru"], lcolor = "black")
Plot_ConvexHull(martinx.imp.pca$x[,1][martinx.imp[,2]=="niveus"],martinx.imp.pca$x[,2][martinx.imp[,2]=="niveus"], lcolor = "magenta")
Plot_ConvexHull(martinx.imp.pca$x[,1][martinx.imp[,2]=="obtusiflorus"],martinx.imp.pca$x[,2][martinx.imp[,2]=="obtusiflorus"], lcolor = "cyan3")
Plot_ConvexHull(martinx.imp.pca$x[,1][martinx.imp[,2]=="somaliensis"],martinx.imp.pca$x[,2][martinx.imp[,2]=="obtusiflorus"], lcolor = "sienna")
Plot_ConvexHull(martinx.imp.pca$x[,1][martinx.imp[,2]=="sphaerocephalus"],martinx.imp.pca$x[,2][martinx.imp[,2]=="obtusiflorus"], lcolor = "khaki")
Plot_ConvexHull(martinx.imp.pca$x[,1][martinx.imp[,2]=="tisserantii"],martinx.imp.pca$x[,2][martinx.imp[,2]=="tisserantii"], lcolor = "grey")
NA
NA
Now replot with 95% confidence ellipses.
Now plot with 95% confidence ellipses using using the ELLI function from Julien Claude’s book (Morphometrics with R) website:
Run ELLI function:
#run the ELLI function:
ELLI<-function(x,y,conf=0.95,np=50)
{centroid<-apply(cbind(x,y),2,mean)
ang <- seq(0,2*pi,length=np)
z<-cbind(cos(ang),sin(ang))
radiuscoef<-qnorm((1-conf)/2, lower.tail=F)
vcvxy<-var(cbind(x,y))
r<-cor(x,y)
M1<-matrix(c(1,1,-1,1),2,2)
M2<-matrix(c(var(x), var(y)),2,2)
M3<-matrix(c(1+r, 1-r),2,2, byrow=T)
ellpar<-M1*sqrt(M2*M3/2)
t(centroid + radiuscoef * ellpar %*% t(z))}
par(mfrow=c(1,1))
plot(martinx.imp.pca$x[,1], martinx.imp.pca$x[,2],pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey") [martinx.imp[,2]], cex=1, asp=1)
legend("topleft",inset=0.025, legend=c("karlschumannii","ledermannii","margaritaceus", "nduru","niveus","obtusiflorus","somaliensis", "sphaerocephalus","tisserantii"),pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), cex=0.7)
#plot the ellipses:
lines(ELLI(martinx.imp.pca$x[,1][martinx.imp[,2]=="karlschumannii"],martinx.imp.pca$x[,2][martinx.imp[,2]=="karlschumannii"]), col = "red")
lines(ELLI(martinx.imp.pca$x[,1][martinx.imp[,2]=="ledermannii"],martinx.imp.pca$x[,2][martinx.imp[,2]=="ledermannii"]), col = "blue")
lines(ELLI(martinx.imp.pca$x[,1][martinx.imp[,2]=="margaritaceus"],martinx.imp.pca$x[,2][martinx.imp[,2]=="margaritaceus"]), col = "green")
lines(ELLI(martinx.imp.pca$x[,1][martinx.imp[,2]=="nduru"],martinx.imp.pca$x[,2][martinx.imp[,2]=="nduru"]), col = "black")
lines(ELLI(martinx.imp.pca$x[,1][martinx.imp[,2]=="niveus"],martinx.imp.pca$x[,2][martinx.imp[,2]=="niveus"]), col = "magenta")
lines(ELLI(martinx.imp.pca$x[,1][martinx.imp[,2]=="obtusiflorus"],martinx.imp.pca$x[,2][martinx.imp[,2]=="obtusiflorus"]), col = "cyan3")
lines(ELLI(martinx.imp.pca$x[,1][martinx.imp[,2]=="somaliensis"],martinx.imp.pca$x[,2][martinx.imp[,2]=="somaliensis"]), col = "sienna")
lines(ELLI(martinx.imp.pca$x[,1][martinx.imp[,2]=="sphaerocephalus"],martinx.imp.pca$x[,2][martinx.imp[,2]=="sphaerocephalus"]), col = "khaki")
lines(ELLI(martinx.imp.pca$x[,1][martinx.imp[,2]=="tisserantii"],martinx.imp.pca$x[,2][martinx.imp[,2]=="tisserantii"]), col = "grey")
NA
NA
Plot the loadings for the first two principal components
par(mfrow=c(1,2))
barplot(sort(martinx.imp.pca$rotation[,1]), las=2, cex.names=0.7, main="PC1", ylab="Loadings")
barplot(sort(martinx.imp.pca$rotation[,2]), las=2, cex.names=0.7, main="PC2", ylab="Loadings")
par(mfrow=c(1,1))
Plot the biplot
biplot(martinx.imp.pca, cex=0.5)
Linear discriminant analysis (LDA)
Following Legendre & Legendre (2012, Numerical Ecology, 3rd English Edition), we should analyse untransformed data for testing the allocation of individuals to populations, and standardized data for computing the contribution of each variable to the discriminant function axes. But first we have to test the data for homoscedasticity (data is homoscedastic if the species/groups/categorical variables have equal/homogeneous covariance matrices) with “betadisper� function in “vegan� package.
Make sure you have installed the vegan package (command is install.packages(“vegan�))
Test for multivariate homogeneity of variances using function “betadisper� in package “vegan�. Start by checking the for homoscedasticity (multivariate homogeneity of within-group covariance matrices) of the multivariate data ([Borcard et al., 2011], page 208). We use the “betadisper� function of the “vegan� package, implementing Marti Anderson’s testing method. First compute the Euclidean distance matrix between pairs of objects:
library(vegan)
Loading required package: permute
Loading required package: lattice
This is vegan 2.5-7
means.dist<- dist(martinx.imp[, 3:14])
means.MHV<-betadisper(means.dist, martinx.imp[,2])
means.MHV
Homogeneity of multivariate dispersions
Call: betadisper(d = means.dist, group = martinx.imp[, 2])
No. of Positive Eigenvalues: 12
No. of Negative Eigenvalues: 0
Average distance to median:
karlschumannii ledermannii margaritaceus nduru
25.749 12.144 18.280 10.931
niveus obtusiflorus somaliensis sphaerocephalus
17.545 17.513 3.723 12.399
tisserantii
10.932
Eigenvalues for PCoA axes:
(Showing 8 of 12 eigenvalues)
PCoA1 PCoA2 PCoA3 PCoA4 PCoA5 PCoA6 PCoA7 PCoA8
213818.5 22662.7 18618.4 11950.6 2434.7 2094.1 928.3 370.1
anova(means.MHV)
Analysis of Variance Table
Response: Distances
Df Sum Sq Mean Sq F value Pr(>F)
Groups 8 7184 897.96 8.4702 8.417e-11 ***
Residuals 480 50886 106.01
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
permutest(means.MHV)
Permutation test for homogeneity of multivariate dispersions
Permutation: free
Number of permutations: 999
Response: Distances
Df Sum Sq Mean Sq F N.Perm Pr(>F)
Groups 8 7184 897.96 8.4702 999 0.001 ***
Residuals 480 50886 106.01
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In the above, the Analysis of Variance is a parametric test. The permutation test is non-parametric.
These results show that the species groups do not have homogeneous covariance matrices, i.e. they are heteroscadistic. This is not surprising since they are different species. For the purpose of plotting the ordinations of the species, this is not really a problem. It is a problem if you want to use multivariate analysis of variance (MANOVA) to compute the significance of the differences between the species group centroids (means). However, we can get round this using non-parametric tests.
Now we compute the LDA, after loading the MASS package:
library(MASS)
lda<-lda(martinx.imp[, 3:14], martinx.imp[,2])
lda
Call:
lda(martinx.imp[, 3:14], martinx.imp[, 2])
Prior probabilities of groups:
karlschumannii ledermannii margaritaceus nduru
0.02862986 0.02658487 0.41104294 0.12269939
niveus obtusiflorus somaliensis sphaerocephalus
0.02453988 0.14723926 0.01226994 0.13087935
tisserantii
0.09611452
Group means:
Lflen Lfwid Cullen Culwid Numbrac
karlschumannii 37.128643 1.7035714 72.671367 1.7428571 2.607143
ledermannii 14.743312 2.7461538 38.309267 1.6000000 3.807692
margaritaceus 22.146063 1.2690124 38.258418 1.1181169 2.677363
nduru 5.997958 0.5658333 19.923333 0.8886667 2.675000
niveus 17.177927 1.1186542 34.341837 1.1086694 3.375000
obtusiflorus 17.561627 1.3098611 31.125694 1.1773611 3.451389
somaliensis 4.583333 0.4000000 6.133333 0.5166667 2.416667
sphaerocephalus 15.218403 1.0339813 26.860937 1.1351563 3.371094
tisserantii 11.990000 0.6308511 17.636596 0.7997872 3.702128
Lenbrac Numspk Spklen Spkwid Gllen Glwid
karlschumannii 5.975998 3.535714 26.382143 6.778571 5.821429 4.092857
ledermannii 4.757692 12.846154 11.584615 5.792308 6.969231 1.484615
margaritaceus 5.412128 4.146353 14.359368 5.833215 5.293510 2.591835
nduru 1.715500 3.835000 8.312500 4.310500 4.411000 1.853333
niveus 4.954167 7.625000 11.709167 4.907500 5.011667 1.516667
obtusiflorus 4.615773 15.974861 9.681111 4.897778 5.340694 1.376389
somaliensis 2.041667 8.250000 5.166667 2.683333 3.283333 1.083333
sphaerocephalus 3.384063 9.028125 10.898438 6.631250 6.092187 1.820312
tisserantii 3.018085 5.459805 8.198389 4.099792 4.461523 1.605128
Numgl
karlschumannii 26.469286
ledermannii 12.038462
margaritaceus 14.741902
nduru 8.504577
niveus 13.895833
obtusiflorus 10.577361
somaliensis 7.166667
sphaerocephalus 10.789062
tisserantii 8.511277
Coefficients of linear discriminants:
LD1 LD2 LD3 LD4 LD5
Lflen -0.023525716 0.04721197 0.03339859 0.087453512 0.020389863
Lfwid -0.050499480 0.11003994 -0.03215027 -0.411868760 -0.510151118
Cullen -0.001049291 -0.01240766 0.02637771 -0.034887998 -0.006978838
Culwid -0.193396314 0.67623511 -1.25282683 -1.650969331 0.704954296
Numbrac 0.317217302 -0.20156809 0.12812841 -0.002648659 0.997125842
Lenbrac -0.057260622 -0.05334681 0.02657255 -0.003301194 -0.356712335
Numspk 0.138681816 0.14654102 0.12413233 0.096384224 0.034244679
Spklen -0.045497741 0.03549820 0.08874904 0.064787702 0.160510267
Spkwid 0.077392289 -0.07153027 -0.56378902 0.623529349 -0.073350292
Gllen 0.295913140 0.58331145 -0.48217096 -0.490178211 -0.133578753
Glwid -1.152399505 0.10950581 0.44181671 0.275730541 0.623281770
Numgl -0.032087894 0.07627901 -0.05164136 -0.093020425 -0.057264312
LD6 LD7 LD8
Lflen 0.04948426 0.008739104 0.02828168
Lfwid 0.30095690 0.470288628 0.43134241
Cullen -0.01189099 -0.011423317 -0.05421764
Culwid -0.61746631 -0.869524369 -0.82136196
Numbrac 0.84727743 0.138985759 -0.22485825
Lenbrac 0.11331662 -0.043187061 -0.15412531
Numspk -0.13405577 0.056804396 0.02626085
Spklen -0.02325600 0.047768333 0.04356984
Spkwid -0.16202133 -0.196410804 -0.14100907
Gllen 0.20259878 0.194052825 0.23645061
Glwid -0.76148447 1.140571960 -0.07692226
Numgl 0.07367221 -0.141219816 0.13327458
Proportion of trace:
LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8
0.5553 0.2309 0.0852 0.0469 0.0425 0.0331 0.0044 0.0017
The proportion of trace table shows that 78.62% (0.7862) of the variance is expressed in the first two axes
Now we compute the scores for plotting the ordination. What happens here is that we multiply the matrix of variables in the martinx.imp data frame by the eigenvector matrix, which is a sub-object of the lda object (lda$scaling) and described in the above output as “Coefficients of linear discriminants�:
scores<-as.matrix(martinx.imp[,3:14])%*%lda$scaling
head(scores)
LD1 LD2 LD3 LD4 LD5 LD6
[1,] -2.217331 5.374388 -2.143290 -0.07282965 3.8080457 1.6511650
[2,] -2.512219 4.669681 -4.120335 -0.07095028 2.4433946 0.2929680
[3,] -1.675271 5.207487 -4.122514 0.70947675 2.6909460 0.8847473
[4,] -1.548064 5.668314 -3.173801 0.05446960 3.4403451 2.0208105
[5,] -3.530078 5.252783 -2.453074 1.04374979 0.8760719 1.5570799
[6,] -2.391148 6.273904 -3.220752 -0.37658305 2.4234024 0.4511635
LD7 LD8
[1,] 2.3599009 -0.7293124
[2,] 0.6297856 -0.2547761
[3,] 0.7021037 0.7830545
[4,] 1.2388024 0.1052282
[5,] 3.1792555 -1.5148822
[6,] 1.6469284 -0.2528704
Now plot the first two discriminant axes and add legend:
plot(scores[,1], scores[,2],pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey") [martinx.imp[,2]], cex=1, asp=1)
legend("topleft",inset=0.025, legend=c("karlschumannii","ledermannii","margaritaceus", "nduru","niveus","obtusiflorus","somaliensis", "sphaerocephalus","tisserantii"),pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), cex=0.7)
Repeat for axes 2 and 3:
plot(scores[,1], scores[,3],pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey") [martinx.imp[,2]], cex=1, asp=1)
legend("bottomleft",inset=0.025, legend=c("karlschumannii","ledermannii","margaritaceus", "nduru","niveus","obtusiflorus","somaliensis", "sphaerocephalus","tisserantii"),pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), cex=0.7)
plot(scores[,2], scores[,3],pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey") [martinx.imp[,2]], cex=1, asp=1)
legend("bottomleft",inset=0.025, legend=c("karlschumannii","ledermannii","margaritaceus", "nduru","niveus","obtusiflorus","somaliensis", "sphaerocephalus","tisserantii"),pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), cex=0.7)
To plot convex hulls around the species groups we can use the function “Plot ConvexHull� from Ken Takagi’s website “Chit Chat R� to overplot convex hulls onto the original plot.
Load the convex hulls function:
### Plotting function to plot convex hulls
### Filename: Plot_ConvexHull.R
### Notes:
############################################################################
# INPUTS:
# xcoords: x-coordinates of point data
# ycoords: y-coordinates of point data
# lcolor: line color
# OUTPUTS:
# convex hull around data points in a particular color (specified by lcolor)
# FUNCTION:
Plot_ConvexHull<-function(xcoord, ycoord, lcolor){
hpts <- chull(x = xcoord, y = ycoord)
hpts <- c(hpts, hpts[1])
lines(xcoord[hpts], ycoord[hpts], col = lcolor)
}
# END OF FUNCTION
Replot the LDA ordinations and then the convex hulls:
plot(scores[,1], scores[,2],pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey") [martinx.imp[,2]], cex=1, asp=1)
legend("topleft",inset=0.025, legend=c("karlschumannii","ledermannii","margaritaceus", "nduru","niveus","obtusiflorus","somaliensis", "sphaerocephalus","tisserantii"),pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), cex=0.7)
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="karlschumannii"],
scores[,2][martinx.imp[,2]=="karlschumannii"], lcolor = "red")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="ledermannii"],
scores[,2][martinx.imp[,2]=="ledermannii"], lcolor = "blue")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="margaritaceus"],
scores[,2][martinx.imp[,2]=="margaritaceus"], lcolor = "green")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="nduru"],
scores[,2][martinx.imp[,2]=="nduru"], lcolor = "black")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="niveus"],
scores[,2][martinx.imp[,2]=="niveus"], lcolor = "magenta")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="obtusiflorus"],
scores[,2][martinx.imp[,2]=="obtusiflorus"], lcolor = "cyan3")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="somaliensis"],
scores[,2][martinx.imp[,2]=="somaliensis"], lcolor = "sienna")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="sphaerocephalus"],
scores[,2][martinx.imp[,2]=="sphaerocephalus"], lcolor = "khaki")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="tisserantii"],
scores[,2][martinx.imp[,2]=="tisserantii"], lcolor = "grey")
plot(scores[,1], scores[,3],pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey") [martinx.imp[,2]], cex=1, asp=1)
legend("topleft",inset=0.025, legend=c("karlschumannii","ledermannii","margaritaceus", "nduru","niveus","obtusiflorus","somaliensis", "sphaerocephalus","tisserantii"),pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), cex=0.7)
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="karlschumannii"],
scores[,3][martinx.imp[,2]=="karlschumannii"], lcolor = "red")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="ledermannii"],
scores[,3][martinx.imp[,2]=="ledermannii"], lcolor = "blue")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="margaritaceus"],
scores[,3][martinx.imp[,2]=="margaritaceus"], lcolor = "green")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="nduru"],
scores[,3][martinx.imp[,2]=="nduru"], lcolor = "black")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="niveus"],
scores[,3][martinx.imp[,2]=="niveus"], lcolor = "magenta")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="obtusiflorus"],
scores[,3][martinx.imp[,2]=="obtusiflorus"], lcolor = "cyan3")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="somaliensis"],
scores[,3][martinx.imp[,2]=="somaliensis"], lcolor = "sienna")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="sphaerocephalus"],
scores[,3][martinx.imp[,2]=="sphaerocephalus"], lcolor = "khaki")
Plot_ConvexHull(scores[,1][martinx.imp[,2]=="tisserantii"],
scores[,3][martinx.imp[,2]=="tisserantii"], lcolor = "grey")
plot(scores[,2], scores[,3],pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey") [martinx.imp[,2]], cex=1, asp=1)
legend("topleft",inset=0.025, legend=c("karlschumannii","ledermannii","margaritaceus", "nduru","niveus","obtusiflorus","somaliensis", "sphaerocephalus","tisserantii"),pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), cex=0.7)
Plot_ConvexHull(scores[,2][martinx.imp[,2]=="karlschumannii"],
scores[,3][martinx.imp[,2]=="karlschumannii"], lcolor = "red")
Plot_ConvexHull(scores[,2][martinx.imp[,2]=="ledermannii"],
scores[,3][martinx.imp[,2]=="ledermannii"], lcolor = "blue")
Plot_ConvexHull(scores[,2][martinx.imp[,2]=="margaritaceus"],
scores[,3][martinx.imp[,2]=="margaritaceus"], lcolor = "green")
Plot_ConvexHull(scores[,2][martinx.imp[,2]=="nduru"],
scores[,3][martinx.imp[,2]=="nduru"], lcolor = "black")
Plot_ConvexHull(scores[,2][martinx.imp[,2]=="niveus"],
scores[,3][martinx.imp[,2]=="niveus"], lcolor = "magenta")
Plot_ConvexHull(scores[,2][martinx.imp[,2]=="obtusiflorus"],
scores[,3][martinx.imp[,2]=="obtusiflorus"], lcolor = "cyan3")
Plot_ConvexHull(scores[,2][martinx.imp[,2]=="somaliensis"],
scores[,3][martinx.imp[,2]=="somaliensis"], lcolor = "sienna")
Plot_ConvexHull(scores[,2][martinx.imp[,2]=="sphaerocephalus"],
scores[,3][martinx.imp[,2]=="sphaerocephalus"], lcolor = "khaki")
Plot_ConvexHull(scores[,2][martinx.imp[,2]=="tisserantii"],
scores[,3][martinx.imp[,2]=="tisserantii"], lcolor = "grey")
Now plot with 95% confidence ellipses using using the ELLI function from Julien Claude’s book (Morphometrics with R) website:
Run ELLI function:
ELLI<-function(x,y,conf=0.95,np=50)
{centroid<-apply(cbind(x,y),2,mean)
ang <- seq(0,2*pi,length=np)
z<-cbind(cos(ang),sin(ang))
radiuscoef<-qnorm((1-conf)/2, lower.tail=F)
vcvxy<-var(cbind(x,y))
r<-cor(x,y)
M1<-matrix(c(1,1,-1,1),2,2)
M2<-matrix(c(var(x), var(y)),2,2)
M3<-matrix(c(1+r, 1-r),2,2, byrow=T)
ellpar<-M1*sqrt(M2*M3/2)
t(centroid + radiuscoef * ellpar %*% t(z))}
Replot now using ELLI to plot the ellipses:
plot(scores[,1], scores[,2],pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey") [martinx.imp[,2]], cex=1, asp=1)
legend("topleft",inset=0.025, legend=c("karlschumannii","ledermannii","margaritaceus", "nduru","niveus","obtusiflorus","somaliensis", "sphaerocephalus","tisserantii"),pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), cex=0.7)
lines(ELLI(scores[,1][martinx.imp[,2]=="karlschumannii"],
scores[,2][martinx.imp[,2]=="karlschumannii"]), col= "red")
lines(ELLI(scores[,1][martinx.imp[,2]=="ledermannii"],
scores[,2][martinx.imp[,2]=="ledermannii"]), col= "blue")
lines(ELLI(scores[,1][martinx.imp[,2]=="margaritaceus"],
scores[,2][martinx.imp[,2]=="margaritaceus"]), col = "green")
lines(ELLI(scores[,1][martinx.imp[,2]=="nduru"],
scores[,2][martinx.imp[,2]=="nduru"]), col = "black")
lines(ELLI(scores[,1][martinx.imp[,2]=="niveus"],
scores[,2][martinx.imp[,2]=="niveus"]), col = "magenta")
lines(ELLI(scores[,1][martinx.imp[,2]=="obtusiflorus"],
scores[,2][martinx.imp[,2]=="obtusiflorus"]), col = "cyan3")
lines(ELLI(scores[,1][martinx.imp[,2]=="somaliensis"],
scores[,2][martinx.imp[,2]=="somaliensis"]), col = "sienna")
lines(ELLI(scores[,1][martinx.imp[,2]=="sphaerocephalus"],
scores[,2][martinx.imp[,2]=="sphaerocephalus"]), col = "khaki")
lines(ELLI(scores[,1][martinx.imp[,2]=="tisserantii"],
scores[,2][martinx.imp[,2]=="tisserantii"]), col = "grey")
plot(scores[,1], scores[,3],pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey") [martinx.imp[,2]], cex=1, asp=1)
legend("topleft",inset=0.025, legend=c("karlschumannii","ledermannii","margaritaceus", "nduru","niveus","obtusiflorus","somaliensis", "sphaerocephalus","tisserantii"),pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), cex=0.7)
lines(ELLI(scores[,1][martinx.imp[,2]=="karlschumannii"],
scores[,3][martinx.imp[,2]=="karlschumannii"]), col= "red")
lines(ELLI(scores[,1][martinx.imp[,2]=="ledermannii"],
scores[,3][martinx.imp[,2]=="ledermannii"]), col= "blue")
lines(ELLI(scores[,1][martinx.imp[,2]=="margaritaceus"],
scores[,3][martinx.imp[,2]=="margaritaceus"]), col = "green")
lines(ELLI(scores[,1][martinx.imp[,2]=="nduru"],
scores[,3][martinx.imp[,2]=="nduru"]), col = "black")
lines(ELLI(scores[,1][martinx.imp[,2]=="niveus"],
scores[,3][martinx.imp[,2]=="niveus"]), col = "magenta")
lines(ELLI(scores[,1][martinx.imp[,2]=="obtusiflorus"],
scores[,3][martinx.imp[,2]=="obtusiflorus"]), col = "cyan3")
lines(ELLI(scores[,1][martinx.imp[,2]=="somaliensis"],
scores[,3][martinx.imp[,2]=="somaliensis"]), col = "sienna")
lines(ELLI(scores[,1][martinx.imp[,2]=="sphaerocephalus"],
scores[,3][martinx.imp[,2]=="sphaerocephalus"]), col = "khaki")
lines(ELLI(scores[,1][martinx.imp[,2]=="tisserantii"],
scores[,3][martinx.imp[,2]=="tisserantii"]), col = "grey")
plot(scores[,2], scores[,3],pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey") [martinx.imp[,2]], cex=1, asp=1)
legend("topleft",inset=0.025, legend=c("karlschumannii","ledermannii","margaritaceus", "nduru","niveus","obtusiflorus","somaliensis", "sphaerocephalus","tisserantii"),pch=c(rep(16,7))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), cex=0.7)
lines(ELLI(scores[,2][martinx.imp[,2]=="karlschumannii"],
scores[,3][martinx.imp[,2]=="karlschumannii"]), col= "red")
lines(ELLI(scores[,2][martinx.imp[,2]=="ledermannii"],
scores[,3][martinx.imp[,2]=="ledermannii"]), col= "blue")
lines(ELLI(scores[,2][martinx.imp[,2]=="margaritaceus"],
scores[,3][martinx.imp[,2]=="margaritaceus"]), col = "green")
lines(ELLI(scores[,2][martinx.imp[,2]=="nduru"],
scores[,3][martinx.imp[,2]=="nduru"]), col = "black")
lines(ELLI(scores[,2][martinx.imp[,2]=="niveus"],
scores[,3][martinx.imp[,2]=="niveus"]), col = "magenta")
lines(ELLI(scores[,2][martinx.imp[,2]=="obtusiflorus"],
scores[,3][martinx.imp[,2]=="obtusiflorus"]), col = "cyan3")
lines(ELLI(scores[,2][martinx.imp[,2]=="somaliensis"],
scores[,3][martinx.imp[,2]=="somaliensis"]), col = "sienna")
lines(ELLI(scores[,2][martinx.imp[,2]=="sphaerocephalus"],
scores[,3][martinx.imp[,2]=="sphaerocephalus"]), col = "khaki")
lines(ELLI(scores[,2][martinx.imp[,2]=="tisserantii"],
scores[,3][martinx.imp[,2]=="tisserantii"]), col = "grey")
Now plot the loadings on the first two discriminant axes to see which characters are most influential in separating the species:
par(mfrow=c(2,1))
barplot(sort(lda$scaling[,1]), cex=0.8, cex.main=1.2, las=2, cex.names=0.8, main="A", ylab="loadings")
barplot(sort(lda$scaling[,2]), cex=0.8, cex.main=1.2, las=2, cex.names=0.8, main="B", ylab="loadings")
These bar charts tell us which characters are most influential in the scores on each axis. Individuals that have more negative scores on a given discriminant axis are tend to have higher values for the characters (bars) towards the left, and individuals with more positive scores have higher values for the characters (bars) towards the right.
LDA using scaled data
Now we repeat these LDA analyses using scaled data to see what difference this makes.
First make a data frame of the scaled data using the function “scale� , which centres the data (in each column, the mean of the column is subtracted from the individual values) and then the variance of each column is set to zero (by dividing the centred values of each column by the column standard deviation).
martinx.imp.sc<-scale(martinx.imp[, 3:14])
head(martinx.imp.sc)
Lflen Lfwid Cullen Culwid Numbrac Lenbrac
[1,] 0.3438436 0.3826877 0.90977461 -0.8126153 -0.02570601 0.1793059
[2,] -0.5407290 -0.1401218 0.05446955 -0.2586966 -1.17893911 -0.2794728
[3,] -0.5898719 -0.4538075 -0.26626985 -1.3665339 -1.17893911 -0.2998003
[4,] 0.1472719 -0.1401218 -0.21281329 -0.2586966 -0.02570601 0.4263406
[5,] 0.6387011 -0.4015265 -0.21281329 -0.8126153 -0.60232256 2.8966878
[6,] -0.1475856 -0.1401218 0.42866551 -0.2586966 -1.17893911 0.7792474
Numspk Spklen Spkwid Gllen Glwid Numgl
[1,] -0.7805886 0.8115038 0.2497099 0.4151647 1.1385881 0.04564896
[2,] -0.9554338 -0.3604367 0.8917976 -0.6793429 1.1385881 0.74092434
[3,] -0.8680112 0.5809581 1.0202151 0.2327468 0.5094702 0.80002274
[4,] -0.6057433 0.5041096 0.1855011 0.2327468 0.5094702 0.40893034
[5,] -0.7805886 -0.4180731 0.1212923 0.6887916 2.3968239 -0.41670916
[6,] -0.4308980 0.7730795 0.5386493 0.8712096 1.3902353 0.65053854
Now run the LDas on the scaled data frame:
lda.sc<-lda(martinx.imp.sc, martinx.imp[,2])
lda.sc
Call:
lda(martinx.imp.sc, grouping = martinx.imp[, 2])
Prior probabilities of groups:
karlschumannii ledermannii margaritaceus nduru
0.02862986 0.02658487 0.41104294 0.12269939
niveus obtusiflorus somaliensis sphaerocephalus
0.02453988 0.14723926 0.01226994 0.13087935
tisserantii
0.09611452
Group means:
Lflen Lfwid Cullen Culwid
karlschumannii 1.929060809 0.59554580 2.17516459 1.79871554
ledermannii -0.271100277 1.68568973 0.33828471 1.40305935
margaritaceus 0.456485384 0.14116269 0.33556651 0.06843919
nduru -1.130644714 -0.59409470 -0.64456415 -0.56704468
niveus -0.031812053 -0.01605477 0.12619957 0.04227354
obtusiflorus 0.005900256 0.18387484 -0.04572439 0.23252152
somaliensis -1.269682266 -0.76749318 -1.38173020 -1.59733339
sphaerocephalus -0.224405501 -0.10459037 -0.27370366 0.11563122
tisserantii -0.541711813 -0.52611093 -0.76680529 -0.81320456
Numbrac Lenbrac Numspk Spklen Spkwid
karlschumannii -0.4787619 0.5943234 -0.5994988 2.76771138 0.8780386
ledermannii 0.9057515 0.1643752 1.0283874 -0.07521033 0.2447707
margaritaceus -0.3977815 0.3953301 -0.4927315 0.45787896 0.2710368
nduru -0.4005068 -0.9092349 -0.5471701 -0.70385373 -0.7066798
niveus 0.4067564 0.2337124 0.1154934 -0.05128137 -0.3233534
obtusiflorus 0.4948506 0.1142910 1.5754270 -0.44091424 -0.3295960
somaliensis -0.6984253 -0.7941285 0.2247717 -1.30823563 -1.7514635
sphaerocephalus 0.4022516 -0.3203880 0.3608231 -0.20703979 0.7834453
tisserantii 0.7840108 -0.4495439 -0.2630807 -0.72577682 -0.8419729
Gllen Glwid Numgl
karlschumannii 0.52591846 2.5136601 2.44596304
ledermannii 1.57281700 -0.7681230 -0.06238614
margaritaceus 0.04440944 0.6250201 0.40752279
nduru -0.76051889 -0.3041889 -0.67664182
niveus -0.21265702 -0.7277950 0.26046008
obtusiflorus 0.08744582 -0.9042975 -0.31635292
somaliensis -1.78905202 -1.2730305 -0.90919589
sphaerocephalus 0.77287488 -0.3457369 -0.27955523
tisserantii -0.71443701 -0.6164902 -0.67547735
Coefficients of linear discriminants:
LD1 LD2 LD3 LD4 LD5
Lflen -0.23936018 0.48035369 0.33981078 0.889787508 0.2074547
Lfwid -0.04829626 0.10523905 -0.03074759 -0.393899482 -0.4878939
Cullen -0.01962885 -0.23210726 0.49344186 -0.652641949 -0.1305516
Culwid -0.06982842 0.24416405 -0.45235047 -0.596105331 0.2545335
Numbrac 0.27506781 -0.17478521 0.11110365 -0.002296725 0.8646351
Lenbrac -0.16225426 -0.15116404 0.07529623 -0.009354295 -1.0107836
Numspk 0.79316887 0.83811836 0.70995539 0.551254435 0.1958571
Spklen -0.23681766 0.18476962 0.46194250 0.337222720 0.8354627
Spkwid 0.12053227 -0.11140264 -0.87805607 0.971096826 -0.1142372
Gllen 0.32443426 0.63953300 -0.52864424 -0.537423260 -0.1464535
Glwid -0.91588518 0.08703123 0.35113984 0.219140596 0.4953617
Numgl -0.18460538 0.43884202 -0.29709875 -0.535157311 -0.3294482
LD6 LD7 LD8
Lflen 0.5034730 -0.08891519 -0.28774926
Lfwid 0.2878266 -0.44977057 -0.41252353
Cullen -0.2224421 0.21369343 1.01423722
Culwid -0.2229448 0.31395381 0.29656410
Numbrac 0.7346975 -0.12051836 0.19498075
Lenbrac 0.3210951 0.12237528 0.43673101
Numspk -0.7667109 -0.32488383 -0.15019479
Spklen -0.1210484 -0.24863619 -0.22678286
Spkwid -0.2523352 0.30589404 0.21961029
Gllen 0.2221259 -0.21275630 -0.25924053
Glwid -0.6052001 -0.90648507 0.06113501
Numgl 0.4238448 0.81245400 -0.76674415
Proportion of trace:
LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8
0.5553 0.2309 0.0852 0.0469 0.0425 0.0331 0.0044 0.0017
scores<-as.matrix(martinx.imp.sc)%*%lda.sc$scaling
head(scores)
LD1 LD2 LD3 LD4 LD5 LD6
[1,] -1.795831 -0.3742138 0.7004206 0.1320457 0.34028774 0.16048506
[2,] -2.090720 -1.0789210 -1.2766238 0.1339250 -1.02436335 -1.19771192
[3,] -1.253771 -0.5411153 -1.2788028 0.9143521 -0.77681192 -0.60593267
[4,] -1.126564 -0.0802878 -0.3300897 0.2593449 -0.02741286 0.53013059
[5,] -3.108579 -0.4958188 0.3906370 1.2486251 -2.59168606 0.06639996
[6,] -1.969649 0.5253025 -0.3770413 -0.1717077 -1.04435558 -1.03951640
LD7 LD8
[1,] -1.19350294 0.41324563
[2,] 0.53661237 -0.06129069
[3,] 0.46429431 -1.09912129
[4,] -0.07240445 -0.42129500
[5,] -2.01285754 1.19881542
[6,] -0.48053043 -0.06319639
plot(scores[,1], scores[,2],pch=c(rep(16,9))[martinx.imp[,2]], col= c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey") [martinx.imp[,2]], cex=1, asp=1)
par(mfrow=c(2,1))
barplot(sort(lda.sc$scaling[,1]), cex=0.8, cex.main=1.2, las=2, cex.names=0.8, main="A", ylab="loadings")
barplot(sort(lda.sc$scaling[,2]), cex=0.8, cex.main=1.2, las=2, cex.names=0.8, main="B", ylab="loadings")
Scaling the data has no effect on the topology of the ordinations, but it changes the loadings on the discriminant axes (bar chart above) that indicate which variables/characters are influencing the score values the most and the least (the central bars are the least influential). It is best do use the scaled data result to estimate the influence of each character on the discriminant analyses.
LDA using cross-validation
When the data violate the basic assumptions of the parametric LDA approach (data overall with a normal distribution, and group covariance matrices homoscedastic), the best way to estimate the significance of the intergroup differences is to use cross-validation. This is a process of allocating each individual to the group to which it is nearest, as estimated by its (Mahalanobis) distance from each group centroid, as computed from its scores on the discriminant axes. For this it is best to use the untransformed data.
To use cross-validation, we run the “lda� function with the argument “CV=TRUE�:
lda.cv<-lda(martinx.imp[, 3:14], martinx.imp[,2], CV=TRUE)
summary(lda.cv)
Length Class Mode
class 489 factor numeric
posterior 4401 -none- numeric
call 4 -none- call
We make a table of the assignments of the cross-validation (also called a Jackknife test):
cv.jack.table<-table(martinx.imp[,2], lda.cv$class)
cv.jack.table
karlschumannii ledermannii margaritaceus nduru niveus
karlschumannii 9 0 5 0 0
ledermannii 0 9 0 0 0
margaritaceus 9 3 175 5 2
nduru 0 0 5 44 0
niveus 0 0 4 1 0
obtusiflorus 0 0 4 2 2
somaliensis 0 0 0 1 0
sphaerocephalus 0 3 5 3 1
tisserantii 0 0 4 6 0
obtusiflorus somaliensis sphaerocephalus tisserantii
karlschumannii 0 0 0 0
ledermannii 1 0 3 0
margaritaceus 3 0 2 2
nduru 0 0 1 10
niveus 2 0 1 4
obtusiflorus 50 0 10 4
somaliensis 0 3 0 2
sphaerocephalus 5 0 45 2
tisserantii 1 0 0 36
This looks quite successful, apart from niveus, where none of its individuals were successfully allocated. This analysis uses only the quantitative variables. The result for margaritaceus is especially notable, with most individuals assigned to it.
# calculate total errors in cross-validation
# run function "errors.cv"
errors.cv<-function(square.tab) {
errors<-round(sum(square.tab) - sum(diag(square.tab)), 1)
total<-round(sum(square.tab), 0)
errors.prop<-round(errors/total, 4)
result<-c(errors, total, errors.prop)
names(result)<-c("errors", "total", "errors.proportion")
result
}
# Use function "errors.cv" to compute overall errors of cross validation
errors.cv(cv.jack.table)
errors total errors.proportion
118.0000 489.0000 0.2413
# Compute percentage correct assignments for each population
round(diag(cv.jack.table)/rowSums(cv.jack.table),3)*100
karlschumannii ledermannii margaritaceus nduru
64.3 69.2 87.1 73.3
niveus obtusiflorus somaliensis sphaerocephalus
0.0 69.4 50.0 70.3
tisserantii
76.6
# pegar a matriz de probabilidades posteriores de cada individuo e plotar em barras
indiv.probs<-t(lda.cv$posterior)
colnames(indiv.probs)<-martinx.imp$Specimen
barplot(indiv.probs, col=c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), border=NA, space=0, las=2, cex.names=0.4, ylab = "posterior probabilities")
## plot the eigenvectors (= loadings)
par(mfrow=c(1,2))
barplot(sort(lda$scaling[,1]), las=2, cex.names=0.7, cex=0.9, cex.main=0.8, main="discriminant function 1", ylab="pesos")
barplot(sort(lda$scaling[,2]), las=2, cex.names=0.7, cex=0.9, cex.main=0.8, main="discriminant function 2", ylab="pesos")
par(mfrow=c(1,1))
Now we turn this table into proportions and make a bar plot:
proportions<-prop.table(cv.jack.table,2)
proportions
karlschumannii ledermannii margaritaceus nduru
karlschumannii 0.50000000 0.00000000 0.02475248 0.00000000
ledermannii 0.00000000 0.60000000 0.00000000 0.00000000
margaritaceus 0.50000000 0.20000000 0.86633663 0.08064516
nduru 0.00000000 0.00000000 0.02475248 0.70967742
niveus 0.00000000 0.00000000 0.01980198 0.01612903
obtusiflorus 0.00000000 0.00000000 0.01980198 0.03225806
somaliensis 0.00000000 0.00000000 0.00000000 0.01612903
sphaerocephalus 0.00000000 0.20000000 0.02475248 0.04838710
tisserantii 0.00000000 0.00000000 0.01980198 0.09677419
niveus obtusiflorus somaliensis sphaerocephalus
karlschumannii 0.00000000 0.00000000 0.00000000 0.00000000
ledermannii 0.00000000 0.01612903 0.00000000 0.04838710
margaritaceus 0.40000000 0.04838710 0.00000000 0.03225806
nduru 0.00000000 0.00000000 0.00000000 0.01612903
niveus 0.00000000 0.03225806 0.00000000 0.01612903
obtusiflorus 0.40000000 0.80645161 0.00000000 0.16129032
somaliensis 0.00000000 0.00000000 1.00000000 0.00000000
sphaerocephalus 0.20000000 0.08064516 0.00000000 0.72580645
tisserantii 0.00000000 0.01612903 0.00000000 0.00000000
tisserantii
karlschumannii 0.00000000
ledermannii 0.00000000
margaritaceus 0.03333333
nduru 0.16666667
niveus 0.06666667
obtusiflorus 0.06666667
somaliensis 0.03333333
sphaerocephalus 0.03333333
tisserantii 0.60000000
#check the columns sum to 1
apply(prop.table(cv.jack.table,2),2,sum)
karlschumannii ledermannii margaritaceus nduru
1 1 1 1
niveus obtusiflorus somaliensis sphaerocephalus
1 1 1 1
tisserantii
1
Now barplot the proportions table:
par(mar=c(5,4,4,2))
barplot(proportions, main="LDA: assignment of original populations by cross-validation using untransformed data",cex=0.8, cex.lab=0.8, cex.main=0.8, cex.names=0.7, col=c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), ylab="proportion of populations assigned")
legend("topright", title="original", cex=0.7, row.names(proportions),fill=c("red","blue","green","black","magenta","cyan3","sienna", "khaki", "grey"), bg="white", inset=0.045)
This shows the proportions of correctly assigned individuals in the classes produced by cross validation. Each bar is the result of the cross-validation, and the colours are those of the original populations. In this analysis, margaritaceus has a successful result, considering its wide range.
Now add the LDA cross-validation assignments to the data frame martinx.imp as a new column and name the new data frame:
martinx.imp.lda.crossval<-data.frame(martinx.imp, lda.cv$class)
head(martinx.imp.lda.crossval)
NA
Extract the individuals and their LDA assignments
LDA.assignments<-martinx.imp.lda.crossval[,c(1:2,18)]
LDA.assignments
#write this file to Excel csv
write.csv(LDA.assignments, file="LDA assignments to species.csv", row.names = FALSE)
Carry out a Classification Tree analysis
load the package rpart
library(rpart)
Compute and plot the tree
tree1<-rpart(Taxon~ Lflen + Lfwid + Cullen + Culwid + Numbrac + Lenbrac + Numspk + Spklen + Spkwid + Gllen + Glwid + Numgl + Bshpers + Bshtext + Bshglos, data=martinx.imp, method="class")
tree1
n= 489
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 489 288 margaritaceus (0.029 0.027 0.41 0.12 0.025 0.15 0.012 0.13 0.096)
2) Glwid>=1.95 306 113 margaritaceus (0.046 0.0065 0.63 0.12 0.0098 0.039 0 0.11 0.039)
4) Bshtext=firm,papery 261 74 margaritaceus (0.054 0.0077 0.72 0 0.011 0.046 0 0.13 0.034)
8) Numspk< 7.165 217 37 margaritaceus (0.065 0 0.83 0 0.014 0.014 0 0.037 0.041)
16) Cullen>=70.34458 15 5 karlschumannii (0.67 0 0.33 0 0 0 0 0 0) *
17) Cullen< 70.34458 202 27 margaritaceus (0.02 0 0.87 0 0.015 0.015 0 0.04 0.045) *
9) Numspk>=7.165 44 18 sphaerocephalus (0 0.045 0.16 0 0 0.2 0 0.59 0)
18) Spkwid< 5.25 9 4 margaritaceus (0 0 0.56 0 0 0.44 0 0 0) *
19) Spkwid>=5.25 35 9 sphaerocephalus (0 0.057 0.057 0 0 0.14 0 0.74 0) *
5) Bshtext=hard 45 9 nduru (0 0 0.13 0.8 0 0 0 0 0.067)
10) Lflen>=14.25 8 2 margaritaceus (0 0 0.75 0.12 0 0 0 0 0.12) *
11) Lflen< 14.25 37 2 nduru (0 0 0 0.95 0 0 0 0 0.054) *
3) Glwid< 1.95 183 123 obtusiflorus (0 0.06 0.044 0.13 0.049 0.33 0.033 0.16 0.19)
6) Numspk>=11.25 66 19 obtusiflorus (0 0.12 0.045 0 0.03 0.71 0 0.076 0.015)
12) Lfwid>=2.35 10 4 ledermannii (0 0.6 0.2 0 0 0.2 0 0 0) *
13) Lfwid< 2.35 56 11 obtusiflorus (0 0.036 0.018 0 0.036 0.8 0 0.089 0.018) *
7) Numspk< 11.25 117 83 tisserantii (0 0.026 0.043 0.21 0.06 0.11 0.051 0.21 0.29)
14) Lenbrac< 1.325 24 6 nduru (0 0 0 0.75 0 0.042 0.083 0.042 0.083) *
15) Lenbrac>=1.325 93 61 tisserantii (0 0.032 0.054 0.065 0.075 0.13 0.043 0.26 0.34)
30) Spklen>=9.09 41 20 sphaerocephalus (0 0.049 0.073 0.049 0.073 0.17 0 0.51 0.073) *
31) Spklen< 9.09 52 23 tisserantii (0 0.019 0.038 0.077 0.077 0.096 0.077 0.058 0.56) *
plot(tree1)
text(tree1, cex=0.7, xpd=TRUE)
Plot with package rpart.plot
library(rpart.plot)
rpart.plot(tree1)
All boxes will be white (the box.palette argument will be ignored) because
the number of classes in the response 9 is greater than length(box.palette) 6.
To silence this warning use box.palette=0 or trace=-1.
Make a bar chart of the assignments
#check that order of assigned individuals is the same as in the data frame
data.frame(rownames(martinx.imp),tree1$where)
#add assignments to data frame
martinx.imp.rpart.assign<-data.frame(martinx.imp, tree1$where)
head(martinx.imp.rpart.assign)
NA
Find CART assignments within each original species
#make a simplified data frame with just the three factors
d<-martinx.imp.rpart.assign[,c(1:2,18)]
head(d)
#table the species and assignments:
rpart.counts<-table(d$Taxon,d$tree1.where)
rpart.counts
5 6 8 9 11 12 15 16 18 20 21
karlschumannii 10 4 0 0 0 0 0 0 0 0 0
ledermannii 0 0 0 2 0 0 6 2 0 2 1
margaritaceus 5 175 5 2 6 0 2 1 0 3 2
nduru 0 0 0 0 1 35 0 0 18 2 4
niveus 0 3 0 0 0 0 0 2 0 3 4
obtusiflorus 0 3 4 5 0 0 2 45 1 7 5
somaliensis 0 0 0 0 0 0 0 0 2 0 4
sphaerocephalus 0 8 0 26 0 0 0 5 1 21 3
tisserantii 0 9 0 0 1 2 0 1 2 3 29
#check that columns sum to species totals:
data.frame(apply(rpart.counts,1,sum), table(d$Taxon))
NA
Find out which are the species names of the columns; the programme assigns to the column the species with the most individuals assigned to it
assign<-rep(0,11)
for(i in 1:11) {
assign[i]<-names(which.max(rpart.counts[,i]))
}
assign
[1] "karlschumannii" "margaritaceus" "margaritaceus"
[4] "sphaerocephalus" "margaritaceus" "nduru"
[7] "ledermannii" "obtusiflorus" "nduru"
[10] "sphaerocephalus" "tisserantii"
Make the new column names for the counts matrix, avoiding duplication of names
colnames(rpart.counts)<-c("karl.crt","marg1.crt","marg2.crt","sphaer1", "marg3.crt","nduru1.crt","leder.crt","obtus.crt", "nduru2.crt", "sphaer2", "tiss")
rpart.counts
karl.crt marg1.crt marg2.crt sphaer1 marg3.crt
karlschumannii 10 4 0 0 0
ledermannii 0 0 0 2 0
margaritaceus 5 175 5 2 6
nduru 0 0 0 0 1
niveus 0 3 0 0 0
obtusiflorus 0 3 4 5 0
somaliensis 0 0 0 0 0
sphaerocephalus 0 8 0 26 0
tisserantii 0 9 0 0 1
nduru1.crt leder.crt obtus.crt nduru2.crt sphaer2 tiss
karlschumannii 0 0 0 0 0 0
ledermannii 0 6 2 0 2 1
margaritaceus 0 2 1 0 3 2
nduru 35 0 0 18 2 4
niveus 0 0 2 0 3 4
obtusiflorus 0 2 45 1 7 5
somaliensis 0 0 0 2 0 4
sphaerocephalus 0 0 5 1 21 3
tisserantii 2 0 1 2 3 29
Make this table into a table of proportions for each row.
#make table of proportions and table as percentages
proportions<-prop.table(rpart.counts,2)
round(proportions*100, 2)
karl.crt marg1.crt marg2.crt sphaer1 marg3.crt
karlschumannii 66.67 1.98 0.00 0.00 0.00
ledermannii 0.00 0.00 0.00 5.71 0.00
margaritaceus 33.33 86.63 55.56 5.71 75.00
nduru 0.00 0.00 0.00 0.00 12.50
niveus 0.00 1.49 0.00 0.00 0.00
obtusiflorus 0.00 1.49 44.44 14.29 0.00
somaliensis 0.00 0.00 0.00 0.00 0.00
sphaerocephalus 0.00 3.96 0.00 74.29 0.00
tisserantii 0.00 4.46 0.00 0.00 12.50
nduru1.crt leder.crt obtus.crt nduru2.crt sphaer2
karlschumannii 0.00 0.00 0.00 0.00 0.00
ledermannii 0.00 60.00 3.57 0.00 4.88
margaritaceus 0.00 20.00 1.79 0.00 7.32
nduru 94.59 0.00 0.00 75.00 4.88
niveus 0.00 0.00 3.57 0.00 7.32
obtusiflorus 0.00 20.00 80.36 4.17 17.07
somaliensis 0.00 0.00 0.00 8.33 0.00
sphaerocephalus 0.00 0.00 8.93 4.17 51.22
tisserantii 5.41 0.00 1.79 8.33 7.32
tiss
karlschumannii 0.00
ledermannii 1.92
margaritaceus 3.85
nduru 7.69
niveus 7.69
obtusiflorus 9.62
somaliensis 7.69
sphaerocephalus 5.77
tisserantii 55.77
#check the columns sum to 100%:
apply(proportions*100,2,sum)
karl.crt marg1.crt marg2.crt sphaer1 marg3.crt nduru1.crt
100 100 100 100 100 100
leder.crt obtus.crt nduru2.crt sphaer2 tiss
100 100 100 100 100
Barplot proportions of original species individuals assigned to each “species� resulting from CART analysis
barplot(as.matrix(proportions), names.arg=colnames(rpart.counts), cex=0.8, cex.names=0.7,
main="tree1, eleven leaves, with imputed data",
cex.lab=0.8, cex.main=0.8, xlab="CART species",
ylab="proportion of species individuals assigned to each column (crt)",
col=rainbow(length(rpart.counts[,1])))
legend("bottomleft", title="orig. species", cex=0.8, rownames(rpart.counts),
fill=rainbow(9), bg="white",
inset=0.05)
NA
NA
Now show the individual assignments made by CART:
res<-rep(0,489)
ass<-as.character(tree1$where)
cartnames<- colnames(rpart.counts)
for(i in 1:length(tree1$where))if(ass[i]=="5"){res[i]<-cartnames[1]}
for(i in 1:length(tree1$where))if(ass[i]=="6"){res[i]<-cartnames[2]}
for(i in 1:length(tree1$where))if(ass[i]=="8"){res[i]<-cartnames[3]}
for(i in 1:length(tree1$where))if(ass[i]=="9"){res[i]<-cartnames[4]}
for(i in 1:length(tree1$where))if(ass[i]=="11"){res[i]<-cartnames[5]}
for(i in 1:length(tree1$where))if(ass[i]=="12"){res[i]<-cartnames[6]}
for(i in 1:length(tree1$where))if(ass[i]=="15"){res[i]<-cartnames[7]}
for(i in 1:length(tree1$where))if(ass[i]=="16"){res[i]<-cartnames[8]}
for(i in 1:length(tree1$where))if(ass[i]=="18"){res[i]<-cartnames[9]}
for(i in 1:length(tree1$where))if(ass[i]=="20"){res[i]<-cartnames[10]}
for(i in 1:length(tree1$where))if(ass[i]=="21"){res[i]<-cartnames[11]}
res
[1] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[6] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[11] "marg1.crt" "sphaer1" "marg1.crt" "marg1.crt" "marg1.crt"
[16] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[21] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[26] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[31] "sphaer1" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[36] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[41] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[46] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[51] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[56] "marg2.crt" "marg1.crt" "marg1.crt" "marg1.crt" "karl.crt"
[61] "marg1.crt" "karl.crt" "marg1.crt" "marg1.crt" "marg2.crt"
[66] "leder.crt" "sphaer2" "marg1.crt" "marg1.crt" "marg1.crt"
[71] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[76] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "leder.crt"
[81] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[86] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[91] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[96] "karl.crt" "marg1.crt" "karl.crt" "marg1.crt" "marg1.crt"
[101] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[106] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[111] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[116] "marg1.crt" "marg1.crt" "marg1.crt" "karl.crt" "marg1.crt"
[121] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[126] "marg1.crt" "marg3.crt" "marg1.crt" "marg3.crt" "marg1.crt"
[131] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[136] "marg3.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[141] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg2.crt"
[146] "marg1.crt" "marg1.crt" "marg2.crt" "marg1.crt" "marg1.crt"
[151] "marg1.crt" "marg1.crt" "marg1.crt" "marg3.crt" "marg3.crt"
[156] "marg1.crt" "marg3.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[161] "marg1.crt" "marg1.crt" "marg1.crt" "sphaer2" "marg1.crt"
[166] "marg1.crt" "marg1.crt" "obtus.crt" "marg2.crt" "marg1.crt"
[171] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[176] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[181] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[186] "tiss" "marg1.crt" "marg1.crt" "tiss" "marg1.crt"
[191] "marg1.crt" "marg1.crt" "marg1.crt" "sphaer2" "marg1.crt"
[196] "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt" "marg1.crt"
[201] "marg1.crt" "marg1.crt" "karl.crt" "karl.crt" "marg1.crt"
[206] "karl.crt" "karl.crt" "marg1.crt" "karl.crt" "karl.crt"
[211] "marg1.crt" "karl.crt" "karl.crt" "karl.crt" "karl.crt"
[216] "sphaer2" "sphaer1" "obtus.crt" "leder.crt" "tiss"
[221] "sphaer2" "sphaer1" "leder.crt" "obtus.crt" "leder.crt"
[226] "leder.crt" "leder.crt" "leder.crt" "tiss" "tiss"
[231] "tiss" "tiss" "tiss" "tiss" "marg1.crt"
[236] "marg1.crt" "tiss" "tiss" "marg1.crt" "tiss"
[241] "tiss" "sphaer2" "tiss" "tiss" "tiss"
[246] "nduru1.crt" "tiss" "tiss" "tiss" "marg3.crt"
[251] "tiss" "tiss" "marg1.crt" "tiss" "obtus.crt"
[256] "tiss" "marg1.crt" "sphaer2" "tiss" "tiss"
[261] "tiss" "tiss" "tiss" "marg1.crt" "marg1.crt"
[266] "tiss" "nduru2.crt" "marg1.crt" "tiss" "nduru2.crt"
[271] "tiss" "nduru1.crt" "sphaer2" "marg1.crt" "tiss"
[276] "tiss" "tiss" "marg1.crt" "tiss" "obtus.crt"
[281] "tiss" "marg1.crt" "sphaer2" "sphaer2" "marg1.crt"
[286] "sphaer2" "obtus.crt" "nduru1.crt" "nduru1.crt" "nduru1.crt"
[291] "nduru1.crt" "nduru1.crt" "nduru1.crt" "nduru1.crt" "nduru1.crt"
[296] "nduru1.crt" "nduru1.crt" "tiss" "nduru1.crt" "nduru1.crt"
[301] "sphaer2" "nduru1.crt" "nduru1.crt" "nduru1.crt" "tiss"
[306] "nduru1.crt" "nduru1.crt" "nduru1.crt" "nduru2.crt" "nduru1.crt"
[311] "nduru2.crt" "nduru2.crt" "nduru2.crt" "nduru1.crt" "nduru1.crt"
[316] "nduru2.crt" "nduru2.crt" "nduru2.crt" "nduru1.crt" "nduru1.crt"
[321] "tiss" "nduru2.crt" "nduru2.crt" "nduru1.crt" "nduru1.crt"
[326] "nduru1.crt" "nduru2.crt" "sphaer2" "nduru2.crt" "nduru2.crt"
[331] "nduru1.crt" "nduru1.crt" "nduru2.crt" "nduru1.crt" "nduru2.crt"
[336] "nduru2.crt" "tiss" "nduru2.crt" "nduru2.crt" "nduru1.crt"
[341] "nduru1.crt" "nduru1.crt" "nduru2.crt" "nduru1.crt" "nduru1.crt"
[346] "marg3.crt" "nduru1.crt" "leder.crt" "obtus.crt" "obtus.crt"
[351] "obtus.crt" "obtus.crt" "obtus.crt" "obtus.crt" "obtus.crt"
[356] "obtus.crt" "obtus.crt" "obtus.crt" "obtus.crt" "obtus.crt"
[361] "obtus.crt" "obtus.crt" "tiss" "obtus.crt" "nduru2.crt"
[366] "obtus.crt" "marg2.crt" "tiss" "obtus.crt" "obtus.crt"
[371] "tiss" "sphaer2" "obtus.crt" "obtus.crt" "sphaer2"
[376] "obtus.crt" "tiss" "obtus.crt" "obtus.crt" "obtus.crt"
[381] "obtus.crt" "obtus.crt" "obtus.crt" "obtus.crt" "obtus.crt"
[386] "obtus.crt" "obtus.crt" "obtus.crt" "obtus.crt" "obtus.crt"
[391] "marg1.crt" "sphaer1" "marg1.crt" "sphaer2" "sphaer2"
[396] "sphaer2" "sphaer1" "sphaer1" "sphaer1" "sphaer2"
[401] "obtus.crt" "marg2.crt" "obtus.crt" "marg2.crt" "marg2.crt"
[406] "marg1.crt" "obtus.crt" "obtus.crt" "obtus.crt" "sphaer1"
[411] "obtus.crt" "obtus.crt" "obtus.crt" "obtus.crt" "obtus.crt"
[416] "sphaer2" "obtus.crt" "leder.crt" "tiss" "sphaer2"
[421] "sphaer2" "sphaer2" "sphaer2" "sphaer2" "obtus.crt"
[426] "sphaer2" "sphaer1" "sphaer1" "sphaer2" "sphaer2"
[431] "sphaer2" "sphaer2" "sphaer2" "sphaer2" "sphaer2"
[436] "sphaer2" "sphaer2" "sphaer2" "sphaer2" "marg1.crt"
[441] "sphaer2" "sphaer1" "marg1.crt" "sphaer1" "tiss"
[446] "sphaer1" "sphaer1" "obtus.crt" "sphaer1" "sphaer1"
[451] "obtus.crt" "sphaer1" "marg1.crt" "sphaer1" "sphaer1"
[456] "obtus.crt" "sphaer1" "sphaer1" "tiss" "sphaer1"
[461] "sphaer1" "obtus.crt" "sphaer2" "sphaer2" "sphaer1"
[466] "marg1.crt" "sphaer1" "marg1.crt" "marg1.crt" "marg1.crt"
[471] "sphaer1" "sphaer2" "marg1.crt" "sphaer1" "sphaer1"
[476] "sphaer1" "sphaer1" "sphaer1" "sphaer1" "sphaer1"
[481] "sphaer1" "tiss" "nduru2.crt" "nduru2.crt" "tiss"
[486] "tiss" "tiss" "nduru2.crt" "tiss"
Put assignments in the righthand columns
d.names<-data.frame(d,res)
d.names
#write this file to Excel csv
write.csv(d.names, file="CART assignments to species.csv", row.names = FALSE)
REPEAT THE CART ANALYSIS WITH NON-DEFAULT SETTINGS FOR MINSPLIT AND MINBUCKET
# Compute the tree with minsplit=10 (minimum number of individuals in a node to permit further splitting) and minbucket=5 (minimum number of individuals in a terminal node).
tree3<-rpart(Taxon~ Lflen + Lfwid + Cullen + Culwid + Numbrac + Lenbrac + Numspk + Spklen + Spkwid + Gllen + Glwid + Numgl + Bshpers + Bshtext + Bshglos, data=martinx.imp, method="class", control= rpart.control(minsplit = 10, minbucket = 9))
plot(tree3)
text(tree3, cex=0.7, xpd=TRUE)
printcp(tree3)
Classification tree:
rpart(formula = Taxon ~ Lflen + Lfwid + Cullen + Culwid + Numbrac +
Lenbrac + Numspk + Spklen + Spkwid + Gllen + Glwid + Numgl +
Bshpers + Bshtext + Bshglos, data = martinx.imp, method = "class",
control = rpart.control(minsplit = 10, minbucket = 9))
Variables actually used in tree construction:
[1] Bshtext Cullen Glwid Lenbrac Lflen Lfwid Numspk Spklen
[9] Spkwid
Root node error: 288/489 = 0.58896
n= 489
CP nsplit rel error xerror xstd
1 0.180556 0 1.00000 1.00000 0.037779
2 0.104167 1 0.81944 0.83681 0.038387
3 0.072917 2 0.71528 0.74306 0.038091
4 0.065972 3 0.64236 0.71875 0.037937
5 0.059028 4 0.57639 0.68750 0.037691
6 0.017361 6 0.45833 0.53125 0.035602
7 0.013889 8 0.42361 0.54167 0.035788
8 0.010000 10 0.39583 0.55208 0.035967
plotcp(tree3)
tree4<- prune(tree3, cp=0.032)
plot(tree4)
text(tree4, cex=0.7, xpd=TRUE)
print(tree4)
n= 489
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 489 288 margaritaceus (0.029 0.027 0.41 0.12 0.025 0.15 0.012 0.13 0.096)
2) Glwid>=1.95 306 113 margaritaceus (0.046 0.0065 0.63 0.12 0.0098 0.039 0 0.11 0.039)
4) Bshtext=firm,papery 261 74 margaritaceus (0.054 0.0077 0.72 0 0.011 0.046 0 0.13 0.034)
8) Numspk< 7.165 217 37 margaritaceus (0.065 0 0.83 0 0.014 0.014 0 0.037 0.041) *
9) Numspk>=7.165 44 18 sphaerocephalus (0 0.045 0.16 0 0 0.2 0 0.59 0) *
5) Bshtext=hard 45 9 nduru (0 0 0.13 0.8 0 0 0 0 0.067) *
3) Glwid< 1.95 183 123 obtusiflorus (0 0.06 0.044 0.13 0.049 0.33 0.033 0.16 0.19)
6) Numspk>=11.25 66 19 obtusiflorus (0 0.12 0.045 0 0.03 0.71 0 0.076 0.015) *
7) Numspk< 11.25 117 83 tisserantii (0 0.026 0.043 0.21 0.06 0.11 0.051 0.21 0.29)
14) Lenbrac< 1.325 24 6 nduru (0 0 0 0.75 0 0.042 0.083 0.042 0.083) *
15) Lenbrac>=1.325 93 61 tisserantii (0 0.032 0.054 0.065 0.075 0.13 0.043 0.26 0.34)
30) Spklen>=9.09 41 20 sphaerocephalus (0 0.049 0.073 0.049 0.073 0.17 0 0.51 0.073) *
31) Spklen< 9.09 52 23 tisserantii (0 0.019 0.038 0.077 0.077 0.096 0.077 0.058 0.56) *
tree5<- prune(tree3, cp=0.016)
plot(tree5)
text(tree5, cex=0.7, xpd=TRUE)
print(tree5)
n= 489
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 489 288 margaritaceus (0.029 0.027 0.41 0.12 0.025 0.15 0.012 0.13 0.096)
2) Glwid>=1.95 306 113 margaritaceus (0.046 0.0065 0.63 0.12 0.0098 0.039 0 0.11 0.039)
4) Bshtext=firm,papery 261 74 margaritaceus (0.054 0.0077 0.72 0 0.011 0.046 0 0.13 0.034)
8) Numspk< 7.165 217 37 margaritaceus (0.065 0 0.83 0 0.014 0.014 0 0.037 0.041)
16) Cullen>=70.34458 15 5 karlschumannii (0.67 0 0.33 0 0 0 0 0 0) *
17) Cullen< 70.34458 202 27 margaritaceus (0.02 0 0.87 0 0.015 0.015 0 0.04 0.045) *
9) Numspk>=7.165 44 18 sphaerocephalus (0 0.045 0.16 0 0 0.2 0 0.59 0)
18) Spkwid< 5.25 9 4 margaritaceus (0 0 0.56 0 0 0.44 0 0 0) *
19) Spkwid>=5.25 35 9 sphaerocephalus (0 0.057 0.057 0 0 0.14 0 0.74 0) *
5) Bshtext=hard 45 9 nduru (0 0 0.13 0.8 0 0 0 0 0.067) *
3) Glwid< 1.95 183 123 obtusiflorus (0 0.06 0.044 0.13 0.049 0.33 0.033 0.16 0.19)
6) Numspk>=11.25 66 19 obtusiflorus (0 0.12 0.045 0 0.03 0.71 0 0.076 0.015) *
7) Numspk< 11.25 117 83 tisserantii (0 0.026 0.043 0.21 0.06 0.11 0.051 0.21 0.29)
14) Lenbrac< 1.325 24 6 nduru (0 0 0 0.75 0 0.042 0.083 0.042 0.083) *
15) Lenbrac>=1.325 93 61 tisserantii (0 0.032 0.054 0.065 0.075 0.13 0.043 0.26 0.34)
30) Spklen>=9.09 41 20 sphaerocephalus (0 0.049 0.073 0.049 0.073 0.17 0 0.51 0.073) *
31) Spklen< 9.09 52 23 tisserantii (0 0.019 0.038 0.077 0.077 0.096 0.077 0.058 0.56) *
# Adjust the rpart.plot arguments *type*, *tweak* and *extra* to give a more useful and informative display of the tree
rpart.plot(tree4, type=0, tweak=0.9, extra=2+100)
rpart.plot(tree5, type=0, tweak=0.9, extra=2+100)
Make a bar chart of the assignments
#check that order of assigned individuals is the same as in the data frame
data.frame(rownames(martinx.imp),tree4$where)
#add assignments to data frame
martinx.imp.rpart.assign.tree4<-data.frame(martinx.imp, tree4$where)
head(martinx.imp.rpart.assign.tree4)
NA
Find CART assignments within each original species
#make a simplified data frame with just the three factors
dtree4<-martinx.imp.rpart.assign.tree4[,c(1:2,18)]
head(dtree4)
#table the species and assignments:
rpart.counts.tree4<-table(dtree4$Taxon,dtree4$tree4.where)
rpart.counts.tree4
4 5 6 8 10 12 13
karlschumannii 14 0 0 0 0 0 0
ledermannii 0 2 0 8 0 2 1
margaritaceus 180 7 6 3 0 3 2
nduru 0 0 36 0 18 2 4
niveus 3 0 0 2 0 3 4
obtusiflorus 3 9 0 47 1 7 5
somaliensis 0 0 0 0 2 0 4
sphaerocephalus 8 26 0 5 1 21 3
tisserantii 9 0 3 1 2 3 29
#check that rows sum to species totals:
data.frame(apply(rpart.counts.tree4,1,sum), table(dtree4$Taxon))
NA
Find out which are the species names of the columns; the programme assigns to the column the species with the most individuals assigned to it
assign.tree4<-rep(0,7)
for(i in 1:7) {
assign.tree4[i]<-names(which.max(rpart.counts.tree4[,i]))
}
assign.tree4
[1] "margaritaceus" "sphaerocephalus" "nduru"
[4] "obtusiflorus" "nduru" "sphaerocephalus"
[7] "tisserantii"
Make the new column names for the counts matrix, avoiding duplication of names
colnames(rpart.counts.tree4)<-c("marg.crt","sphaer1.crt","nduru1.crt","obtusi.crt","nduru2.crt","sphaer2.crt","tisser.crt")
rpart.counts.tree4
marg.crt sphaer1.crt nduru1.crt obtusi.crt nduru2.crt
karlschumannii 14 0 0 0 0
ledermannii 0 2 0 8 0
margaritaceus 180 7 6 3 0
nduru 0 0 36 0 18
niveus 3 0 0 2 0
obtusiflorus 3 9 0 47 1
somaliensis 0 0 0 0 2
sphaerocephalus 8 26 0 5 1
tisserantii 9 0 3 1 2
sphaer2.crt tisser.crt
karlschumannii 0 0
ledermannii 2 1
margaritaceus 3 2
nduru 2 4
niveus 3 4
obtusiflorus 7 5
somaliensis 0 4
sphaerocephalus 21 3
tisserantii 3 29
Make this table into a table of proportions for each row.
#make table of proportions and table as percentages
proportions<-prop.table(rpart.counts.tree4,2)
round(proportions*100, 2)
marg.crt sphaer1.crt nduru1.crt obtusi.crt nduru2.crt
karlschumannii 6.45 0.00 0.00 0.00 0.00
ledermannii 0.00 4.55 0.00 12.12 0.00
margaritaceus 82.95 15.91 13.33 4.55 0.00
nduru 0.00 0.00 80.00 0.00 75.00
niveus 1.38 0.00 0.00 3.03 0.00
obtusiflorus 1.38 20.45 0.00 71.21 4.17
somaliensis 0.00 0.00 0.00 0.00 8.33
sphaerocephalus 3.69 59.09 0.00 7.58 4.17
tisserantii 4.15 0.00 6.67 1.52 8.33
sphaer2.crt tisser.crt
karlschumannii 0.00 0.00
ledermannii 4.88 1.92
margaritaceus 7.32 3.85
nduru 4.88 7.69
niveus 7.32 7.69
obtusiflorus 17.07 9.62
somaliensis 0.00 7.69
sphaerocephalus 51.22 5.77
tisserantii 7.32 55.77
#check the columns sum to 100%:
apply(proportions*100,2,sum)
marg.crt sphaer1.crt nduru1.crt obtusi.crt nduru2.crt sphaer2.crt
100 100 100 100 100 100
tisser.crt
100
barplot proportions of original species individuals assigned to each “species� resulting from CART analysis
barplot(as.matrix(proportions), names.arg=colnames(rpart.counts.tree4), cex=0.8, cex.names=0.7,
main="tree4, seven leaves, with imputed data",
cex.lab=0.8, cex.main=0.8, xlab="CART species",
ylab="proportion of species individuals assigned to each column (crt)",
col=rainbow(length(rpart.counts.tree4[,1])))
legend("bottomleft", title="orig. species", cex=0.8, rownames(rpart.counts.tree4),
fill=rainbow(7), bg="white",
inset=0.05)
Now show the individual assignments made by CART:
res<-rep(0,489)
ass<-as.character(tree4$where)
cartnames<- colnames(rpart.counts.tree4)
for(i in 1:length(tree1$where))if(ass[i]=="4"){res[i]<-cartnames[1]}
for(i in 1:length(tree1$where))if(ass[i]=="5"){res[i]<-cartnames[2]}
for(i in 1:length(tree1$where))if(ass[i]=="6"){res[i]<-cartnames[3]}
for(i in 1:length(tree1$where))if(ass[i]=="8"){res[i]<-cartnames[4]}
for(i in 1:length(tree1$where))if(ass[i]=="10"){res[i]<-cartnames[5]}
for(i in 1:length(tree1$where))if(ass[i]=="12"){res[i]<-cartnames[6]}
for(i in 1:length(tree1$where))if(ass[i]=="13"){res[i]<-cartnames[7]}
res
[1] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[5] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[9] "marg.crt" "marg.crt" "marg.crt" "sphaer1.crt"
[13] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[17] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[21] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[25] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[29] "marg.crt" "marg.crt" "sphaer1.crt" "marg.crt"
[33] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[37] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[41] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[45] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[49] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[53] "marg.crt" "marg.crt" "marg.crt" "sphaer1.crt"
[57] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[61] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[65] "sphaer1.crt" "obtusi.crt" "sphaer2.crt" "marg.crt"
[69] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[73] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[77] "marg.crt" "marg.crt" "marg.crt" "obtusi.crt"
[81] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[85] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[89] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[93] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[97] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[101] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[105] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[109] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[113] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[117] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[121] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[125] "marg.crt" "marg.crt" "nduru1.crt" "marg.crt"
[129] "nduru1.crt" "marg.crt" "marg.crt" "marg.crt"
[133] "marg.crt" "marg.crt" "marg.crt" "nduru1.crt"
[137] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[141] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[145] "sphaer1.crt" "marg.crt" "marg.crt" "sphaer1.crt"
[149] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[153] "marg.crt" "nduru1.crt" "nduru1.crt" "marg.crt"
[157] "nduru1.crt" "marg.crt" "marg.crt" "marg.crt"
[161] "marg.crt" "marg.crt" "marg.crt" "sphaer2.crt"
[165] "marg.crt" "marg.crt" "marg.crt" "obtusi.crt"
[169] "sphaer1.crt" "marg.crt" "marg.crt" "marg.crt"
[173] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[177] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[181] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[185] "marg.crt" "tisser.crt" "marg.crt" "marg.crt"
[189] "tisser.crt" "marg.crt" "marg.crt" "marg.crt"
[193] "marg.crt" "sphaer2.crt" "marg.crt" "marg.crt"
[197] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[201] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[205] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[209] "marg.crt" "marg.crt" "marg.crt" "marg.crt"
[213] "marg.crt" "marg.crt" "marg.crt" "sphaer2.crt"
[217] "sphaer1.crt" "obtusi.crt" "obtusi.crt" "tisser.crt"
[221] "sphaer2.crt" "sphaer1.crt" "obtusi.crt" "obtusi.crt"
[225] "obtusi.crt" "obtusi.crt" "obtusi.crt" "obtusi.crt"
[229] "tisser.crt" "tisser.crt" "tisser.crt" "tisser.crt"
[233] "tisser.crt" "tisser.crt" "marg.crt" "marg.crt"
[237] "tisser.crt" "tisser.crt" "marg.crt" "tisser.crt"
[241] "tisser.crt" "sphaer2.crt" "tisser.crt" "tisser.crt"
[245] "tisser.crt" "nduru1.crt" "tisser.crt" "tisser.crt"
[249] "tisser.crt" "nduru1.crt" "tisser.crt" "tisser.crt"
[253] "marg.crt" "tisser.crt" "obtusi.crt" "tisser.crt"
[257] "marg.crt" "sphaer2.crt" "tisser.crt" "tisser.crt"
[261] "tisser.crt" "tisser.crt" "tisser.crt" "marg.crt"
[265] "marg.crt" "tisser.crt" "nduru2.crt" "marg.crt"
[269] "tisser.crt" "nduru2.crt" "tisser.crt" "nduru1.crt"
[273] "sphaer2.crt" "marg.crt" "tisser.crt" "tisser.crt"
[277] "tisser.crt" "marg.crt" "tisser.crt" "obtusi.crt"
[281] "tisser.crt" "marg.crt" "sphaer2.crt" "sphaer2.crt"
[285] "marg.crt" "sphaer2.crt" "obtusi.crt" "nduru1.crt"
[289] "nduru1.crt" "nduru1.crt" "nduru1.crt" "nduru1.crt"
[293] "nduru1.crt" "nduru1.crt" "nduru1.crt" "nduru1.crt"
[297] "nduru1.crt" "tisser.crt" "nduru1.crt" "nduru1.crt"
[301] "sphaer2.crt" "nduru1.crt" "nduru1.crt" "nduru1.crt"
[305] "tisser.crt" "nduru1.crt" "nduru1.crt" "nduru1.crt"
[309] "nduru2.crt" "nduru1.crt" "nduru2.crt" "nduru2.crt"
[313] "nduru2.crt" "nduru1.crt" "nduru1.crt" "nduru2.crt"
[317] "nduru2.crt" "nduru2.crt" "nduru1.crt" "nduru1.crt"
[321] "tisser.crt" "nduru2.crt" "nduru2.crt" "nduru1.crt"
[325] "nduru1.crt" "nduru1.crt" "nduru2.crt" "sphaer2.crt"
[329] "nduru2.crt" "nduru2.crt" "nduru1.crt" "nduru1.crt"
[333] "nduru2.crt" "nduru1.crt" "nduru2.crt" "nduru2.crt"
[337] "tisser.crt" "nduru2.crt" "nduru2.crt" "nduru1.crt"
[341] "nduru1.crt" "nduru1.crt" "nduru2.crt" "nduru1.crt"
[345] "nduru1.crt" "nduru1.crt" "nduru1.crt" "obtusi.crt"
[349] "obtusi.crt" "obtusi.crt" "obtusi.crt" "obtusi.crt"
[353] "obtusi.crt" "obtusi.crt" "obtusi.crt" "obtusi.crt"
[357] "obtusi.crt" "obtusi.crt" "obtusi.crt" "obtusi.crt"
[361] "obtusi.crt" "obtusi.crt" "tisser.crt" "obtusi.crt"
[365] "nduru2.crt" "obtusi.crt" "sphaer1.crt" "tisser.crt"
[369] "obtusi.crt" "obtusi.crt" "tisser.crt" "sphaer2.crt"
[373] "obtusi.crt" "obtusi.crt" "sphaer2.crt" "obtusi.crt"
[377] "tisser.crt" "obtusi.crt" "obtusi.crt" "obtusi.crt"
[381] "obtusi.crt" "obtusi.crt" "obtusi.crt" "obtusi.crt"
[385] "obtusi.crt" "obtusi.crt" "obtusi.crt" "obtusi.crt"
[389] "obtusi.crt" "obtusi.crt" "marg.crt" "sphaer1.crt"
[393] "marg.crt" "sphaer2.crt" "sphaer2.crt" "sphaer2.crt"
[397] "sphaer1.crt" "sphaer1.crt" "sphaer1.crt" "sphaer2.crt"
[401] "obtusi.crt" "sphaer1.crt" "obtusi.crt" "sphaer1.crt"
[405] "sphaer1.crt" "marg.crt" "obtusi.crt" "obtusi.crt"
[409] "obtusi.crt" "sphaer1.crt" "obtusi.crt" "obtusi.crt"
[413] "obtusi.crt" "obtusi.crt" "obtusi.crt" "sphaer2.crt"
[417] "obtusi.crt" "obtusi.crt" "tisser.crt" "sphaer2.crt"
[421] "sphaer2.crt" "sphaer2.crt" "sphaer2.crt" "sphaer2.crt"
[425] "obtusi.crt" "sphaer2.crt" "sphaer1.crt" "sphaer1.crt"
[429] "sphaer2.crt" "sphaer2.crt" "sphaer2.crt" "sphaer2.crt"
[433] "sphaer2.crt" "sphaer2.crt" "sphaer2.crt" "sphaer2.crt"
[437] "sphaer2.crt" "sphaer2.crt" "sphaer2.crt" "marg.crt"
[441] "sphaer2.crt" "sphaer1.crt" "marg.crt" "sphaer1.crt"
[445] "tisser.crt" "sphaer1.crt" "sphaer1.crt" "obtusi.crt"
[449] "sphaer1.crt" "sphaer1.crt" "obtusi.crt" "sphaer1.crt"
[453] "marg.crt" "sphaer1.crt" "sphaer1.crt" "obtusi.crt"
[457] "sphaer1.crt" "sphaer1.crt" "tisser.crt" "sphaer1.crt"
[461] "sphaer1.crt" "obtusi.crt" "sphaer2.crt" "sphaer2.crt"
[465] "sphaer1.crt" "marg.crt" "sphaer1.crt" "marg.crt"
[469] "marg.crt" "marg.crt" "sphaer1.crt" "sphaer2.crt"
[473] "marg.crt" "sphaer1.crt" "sphaer1.crt" "sphaer1.crt"
[477] "sphaer1.crt" "sphaer1.crt" "sphaer1.crt" "sphaer1.crt"
[481] "sphaer1.crt" "tisser.crt" "nduru2.crt" "nduru2.crt"
[485] "tisser.crt" "tisser.crt" "tisser.crt" "nduru2.crt"
[489] "tisser.crt"
Put assignments in the righthand columns
d.names.tree4<-data.frame(dtree4,res)
d.names.tree4
#write this file to Excel csv
write.csv(d.names.tree4, file="CART assignments to species.csv tree4", row.names = FALSE)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.(Simon’s keyboard customized to Ctrl+Shift+I)
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).