For a series of studies on the ecosystem service values of chaparral in Southern California, we developed a raster data layer providing an ecological unit classification of the Southern California landscape. This raster dataset is at a 30 meter pixel resolution and partitions the landscape into 37 different ecological unit types. This dataset was derived through a GIS-based cluster analysis of 10 different physiographic variables, namely soil suborder type, terrain geomorphon type, flow accumulation, slope, solar irradiation, annual precipitation, annual minimum temperature, actual evapotranspiration, and climatic water deficit. This partitioning was based on physiographic variables rather than vegetation types because of the wish to have the ecological units reflect biophysical characteristics rather than the historical land use patterns that may influence vegetation. The cluster analysis was performed across a set of 10,000 points randomly placed on a GIS layer stack for the 10 variables. These random points were grouped into 37 discrete clusters using an algorithm called partitioning around medoids. This assignment of points to clusters was then used to train a random forest classifier, which in turn was run across the GIS stack to produce the output raster layer.

This dataset is described in the following book chapter publication:

Underwood, Emma C., Allan D. Hollander, Patrick R. Huber, and Charlie Schrader-Patton. 2018. “Mapping the Value of National Forest Landscapes for Ecosystem Service Provision.” In Valuing Chaparral, 245–70. Springer Series on Environmental Management. Springer, Cham. https://doi.org/10.1007/978-3-319-68303-4_9.

Summary of Methods for Developing Ecological Units in Southern California

Allan Hollander and Emma Underwood, University of California Davis.

1) Compiling GIS layers. These data were compiled from a variety of sources and resolutions (Table 1) for the southern California study area (see Methods_figure_1.png for the study area). The original resolution of these raster layers ran from 10 meters to 270 meters, and resampling was conducted so all analyses were performed at a 30 meter raster resolution. We decided not to include vegetation in the data stack as the aim was to capture biophysical characteristics and vegetation will reflect current landscape history and land use patterns (e.g. fire history, type conversion from shrubland, or agricultural use). Lakes and reservoirs were omitted from the subsequent analysis. Data compiled:

a) Soil suborders. This was a discretely-classified raster layer with 22 soil suborder classes included in the southern California region. This was derived from the gridded Soil Survey Geographic Database (gSSURGO, available at http://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/survey/geo/?cid=nrcs142p2_053628). This product is a rasterization at a 10-meter resolution of the county-scale SSURGO data published by the USDA Natural Resources Conservation Service.

b) Terrain geomorphons. This raster layer derives from a DEM surface and classifies the landscape into 10 discrete landform types, examples being ridges, slopes, hollows, and valleys. The algorithm for geomorphon classification uses a pattern recognition approach based on line of sight analysis (Jasiewisc and Stepinski 2013). This layer was created from a 30 meter DEM in GRASS 7.0.0, using the extension r.geomorphon (https://grass.osgeo.org/grass70/manuals/addons/r.geomorphon.html).

c) Annualized solar irradiation. This layer uses the r.sun model available in GRASS 7.0.0 (https://grass.osgeo.org/grass70/manuals/r.sun.html) which calculates direct, diffuse, and reflected solar irradiation for a given day, location, topography, and atmospheric conditions. This layer was created from a 30 meter DEM and assumes clear-sky conditions. To estimate the total annual irradiation, the model was run for every 15th day and these values were integrated over the year.

d) Flow accumulation. This layer is another product of 30 meter DEM data and measures the upslope area in pixel count that conceivably drains into a given pixel. This was calculated using the accumulation option in the GRASS 7.0.0 command r.watershed (https://grass.osgeo.org/grass70/manuals/r.watershed.html)

e) Slope. This was derived from 30 meter DEM data using the GRASS 7.0.0 command r.slope.aspect, and is measured in degrees.

f) Annual precipitation. This layer came from the 2014 Basin Characterization Model (BCM) for California (Flint et al. 2013) and gives the average annual precipitation between 1981 and 2010 at a 270-meter resolution.

g) Annual minimum temperature. This layer also came from BCM (Flint et al. 2013) and gives the average annual minimum temperature between 1981 and 2010 at a 270-meter resolution. Minimum temperature was included in the set of climate variables to represent montane winter conditions.

h) Climatic water deficit. This layer also came from the BCM (Flint et al. 2013) and gives the average climatic water deficit between 1981 and 2010 at a 270-meter resolution. The two evapotranspiration variables (climatic water deficit and actual evapotranspiration) are included in this set because they are strong drivers of vegetation distribution (Stephenson 1998).

i) Actual evapotranspiration. This layer also came from the BCM (Flint et al. 2013) and gives the average actual evapotranspiration between 1981 and 2010 at a 270-meter resolution.

Table 1. Summary of GIS data stack

LAYER	ORIGINAL SOURCE	ORIGINAL RESOLUTION	THEME
Soil suborders	gSSURGO	10 meters	Soil type
Terrain geomorphons	Digital elevation model	30 meters	Geomorphometry
Solar irradiation	Digital elevation model	30 meters	Energy balance
Flow accumulation	Digital elevation model	30 meters	Geomorphometry
Slope	Digital elevation model	30 meters	Geomorphometry
Annual precipitation	Basin Characterization Model	270 meters	Climate
Annual min temperature	Basin Characterization Model	270 meters	Climate
Climatic water deficit	Basin Characterization Model	270 meters	Climate
Actual evapotranspiration	Basin Characterization Model	270 meters	Climate

2) Generating 10,000 random points. A mask was imposed to limit analyses to the 35,158 square study area and 10,000 random points were generated to create a data table of the values of each GIS layer at each of the random points. This data table was the basis for sorting the random points into a limited number of clustered types. The first step in doing this is calculating in multivariate space the distance with respect to these environmental variables each random point is from every other point, in other words creating a dissimilarity matrix.

3) Assigning weights to variables. Because the 9 environmental variables use completely different metrics and are a combination of numerical and categorical types, calculating an environmental distance between any two of these random points requires some weighting to be assigned to each of the environmental variables to sum up their relative distances. A subanalysis to determine these weightings used a subset of the study area, the Santa Clara River watershed. Since these ecological units are intended to summarize a diverse set of ecological services, we chose three different proxy variables from the GIS data available for this area to represent biomass, hydrological response, and biodiversity. These proxies included mean annual MODIS Enhanced Vegetation Index (EVI) value for biomass, recharge for hydrological response, and habitat type in the California Wildlife Habitat Relations (CWHR) classification for biodiversity.

The MODIS EVI data was derived by averaging over the 2000-2014 period the maximum EVI value in a single year. The MODIS index used was MOD13Q1 (https://lpdaac.usgs.gov/dataset_discovery/modis/modis_products_table/mod13q1) at a 250 meter resolution, available at 16-day intervals.
The hydrological recharge data were extracted from the 2014 Basin Characterization Model (Flint et al. 2013) at 270 meter resolution.
The CWHR habitat type came from the 2015 FRAP vegetation layer (FVEG15_1, from http://frap.fire.ca.gov/data/frapgisdata-sw-fveg_download), available at a 30 meter resolution.

a) We used random forest regression and classification (Hastie et al. 2009) to determine a ranking of importance values of these predictor variables using random forest regression for EVI and recharge and random forest classification for the habitat type. These were calculated using the randomForest package in R (Liaw and Wiener 2002).

b) We then averaged these three sets of importance values to create an overall set of weightings to enter into the dissimilarity matrix (Table 2).

Table 2. Weightings for each variable to reflect their relative importance to the ecological units

VARIABLE NAME	WEIGHT
Precipitation	1.00
Annual minimum temperature	0.600
Slope	0.507
Climatic water deficit	0.413
Annualized solar radiation	0.404
Soil suborder type	0.374
Actual evapotranspiration	0.367
Flow accumulation	0.355
Geomorphon type	0.226

4) Creating a dissimilarity matrix. We applied the weightings of the subanalysis to create a dissimilarity matrix for the10,000 random points in the southern California study area. This was calculated using the daisy function in the R package cluster (Maechler et al. 2015). The “gower” metric was used in the daisy function: this distance metric (Gower 1971) allows one to combine numerical and categorical variables in a single measure.

5) Clustering random points. To group these random points into discrete cluster types we used a clustering technique called Partitioning Around Medoids, or PAM (Kaufman and Rousseeuw 2005, Hollander 2012). This is a partitional cluster algorithm, where the set of clusters is established all at once, usually requiring assignment of the desired number of clusters beforehand. The PAM technique is similar to the more commonly used k-means algorithm but differs from the latter in that a) cluster centroids are assigned to actual data observations rather than means of the data values and b) the PAM algorithm accepts categorical as well as numerical data. We used the pam function in the R cluster package to carry out this clustering.

6) Assessing performance of clusters. Because the PAM algorithm calls for requesting a particular number of clusters ahead of time, we need to establish the optimum number of clusters to portray the landscape variability. An approach for doing this is to calculate sets of provisional clusters over a range of requested numbers of clusters, and to evaluate the performance of each of the sets using a combination of internal and external homogeneity metrics. We created a collection of cluster sets ranging from 23 to 50 clusters in each.

a) The measure used for internal homogeneity was the average silhouette width of the clustering (Rousseeuw 1987). This value, also accompanied by a graphical display called a silhouette plot, helps illustrate which observation points lie near the center of each respective cluster, and which observation points fall in between two or more clusters. Averaging this measure of cluster validity across every observation point gives the average silhouette width of the cluster analysis: a higher value meaning a better partitioned cluster set.

b) The metric used for external homogeneity was a measure of within-cluster variability for the three proxy variables. When this number is minimized, the clusters were best separated with respect to that particular variable. To take an example, for a given cluster set (say one with 40 clusters in it), the standard deviation of the MODIS EVI values was taken for the observation points grouped by cluster assignment. The mean value of these standard deviations was then calculated to give an overall value of within-cluster variability for that particular cluster set. These overall values were then compared across the collection of cluster sets. A low overall value means that the cluster assignment does a good job of separating observation points on the basis of their MODIS EVI value ranges.

Based on this evaluation, looking at the average silhouette width for a measure of internal homogeneity and the three within-cluster variability measures for the proxy variables, we chose the clustering with 37 classes in it across the Southern California study area.

7) Mapping clusters onto southern California landscape. To map the clusters we needed a method to take the cluster assignments for the 10,000 random sample points and generalize these assignments across a raster surface covering the study area. We did this by constructing a random forest model to predict class assignments given environmental variables from the raster stack (constructing the model from data in the original sample point table as independent variables). We then used the raster library in R (Hijmans 2015) to run the random forest model across the GIS stack of 9 input rasters. Finally, we smoothed the GIS cluster map slightly by running a 3x3 majority filter over it.

8) Post-processing. We filled in areas of the map where soil data were absent by creating a supplemental random forest classifier with soils variables removed from the list of environmental variables, and ran this classifier to map clusters in areas with missing soils data.

The primary file in this dataset is named SCpam37n33.tif. It is a raster GIS file in geotiff format with 30 meter resolution. The projection for it is California Albers, NAD83 datum (EPSG:3310). Dimensions of the raster are 22748 rows x 15223 columns, and the bounding box x,y coordinates in the California Albers projection are -235440.000, -156780.000 (upper left) and 447000.000, -613470.000 (lower right). The raster is encoded as bytes with a minimum value of 1, a maximum value of 37, and a nodata value of 255.

The values in the raster represent individual cluster types, and the numbering from 1 to 37 is arbitrary. But by construction of the dataset each raster cluster value portrays similar landscapes. The additional file ELUdescriptions.csv (pipe-delimited) describes the landscapes associated with each cluster value, including common vegetation types found in each cluster. Note that the descriptions in this table were generated by a qualitative analysis of the geographic distribution of each cluster, and are not intended to be definitive keys to identification of each cluster type.

The study region of the dataset consists of the area of the HUC12 USGS watersheds encompassing the four National Forests in Southern California (Los Padres, Angeles, San Bernardino, and Cleveland NFs).

Intended users of this dataset include resource managers, researchers who are carrying out biogeographic studies, and people needing to make estimates of shrubland patterns and properties across this landscape.

This dataset is made available under a CC0 license.

Raster classification and mapping of ecological units of Southern California

Data files

Abstract

Methods

Usage notes