Understanding the Influence of Parameter Value Uncertainty on Climate Model Output: Developing an Interactive Web Dashboard
Data files
May 30, 2024 version files 1.04 GB
Abstract
Scientists at the National Center for Atmospheric Research have recently carried out several experiments to better understand the uncertainties associated with future climate projections. In particular, the NCAR Climate and Global Dynamics Lab (CGDL) working group has completed a large Parameter Perturbation Experiment (PPE) utilizing the Community Land Model (CLM), testing the effects of 32 parameters over thousands of simulations over a range of 250 years. The CLM model experiment is focused on understanding uncertainty around biogeophysical parameters that influence the balance of chemical cycling and sequestration variables. The current website for displaying model results is not intuitive or informative to the broader scientific audience or the general public. The goal of this project is to develop an improved data visualization dashboard for communicating the results of the CLM PPE. The interactive dashboard would provide an interface where new or experienced users can query the experiment database to ask which environmental processes are affected by a given model parameter, or vice versa. Improving the accessibility of the data will allow professionals to use the most recent land parameter data when evaluating the impact of a policy or action on climate change.
README: Understanding the Influence of Parameter Value Uncertainty on Climate Model Output: Developing an Interactive Dashboard
This README.md file was generated on 2024-05-22 by SOFIA INGERSOLL
https://doi.org/10.5061/dryad.vq83bk422
GENERAL INFORMATION
1. Title of the Project: Understanding the Influence of Parameter Value Uncertainty on Climate Model Output: Developing an Interactive Dashboard
3. Author Information: Sofia Ingersoll
A. Principal Investigator Contact Information
Name: Sofia Ingersoll
Institution: Bren School of Environmental Science & Management
Email: singersoll@bren.edu
B. Associate or Co-investigator Contact Information
Name: Heather Childers
Institution: Bren School of Environmental Science & Management
Email: hmchilders@bren.edu
C. Alternate Contact Information
Name: Sofia Ingersoll
Institution: Bren School of Environmental Science & Management
Email: Sofia.Ingersoll@Outlook.com
4. Date of data collection or obtaining (2024-05-23)
5. Geographic location of data collection: University of California, Santa Barbara – Bren School of Environmental Science & Management.
6. Information about funding sources that supported the collection of the data: N/A
SHARING/ACCESS INFORMATION
7. Licenses/restrictions placed on the data:
This data is open source and available for climate scientists to leverage for climate model calibrations and identification of key climate influencers.
8. Links to publications that cite or use the data: N/A
9. Links to other publicly accessible locations of the data:
Visualize the data in this repository using an interactive dashboard. The dashboard enables public access to the National Center for Atmospheric Research – Climate and Global Dynamics Lab: Parameter Perturbation Experiment (NCAR – CGD: PPE) Data. The PPE was generated by NCAR – CGD at the University of California, Santa Barbara. [Link to public facing dashboard will be added here once client integrates into server]
10. Links/relationships to ancillary data sets:
- Visual example of the Community Land Model v5 (CLM5) climate model components and their respective parameter interactions. https://www.cesm.ucar.edu/models/clm
- Table of output variables for the Community Earth Systems Model v2 (CESM2) Large Ensemble Community Project. CESM2 is the climate model that supersedes the CLM5. The predictions provided for the CESM2 output variables were leveraged by the CLM5 to produce its predictions. The variables available in CESM2 are accessible using our dashboard linked above. https://www.cesm.ucar.edu/community-projects/lens2/output-variables
- Table of specifications for the Community Land Model v5 (CLM5) https://www.cesm.ucar.edu/models/clm/data
11. Was data derived from another source? If yes, list source(s):
- University of California, Santa Barbara – Climate and Global Dynamics Lab, National Center for Atmospheric Research: Parameter Perturbation Experiment (CGD NCAR PPE-5). https://webext.cgd.ucar.edu/I2000/PPEn11_OAAT/ (Only public version of the data currently accessible. Data leveraged in this project is currently stored on the NCAR server and is not publicly available), https://www.cgd.ucar.edu/events/seminar/2023/katie-dagon-and-daniel-kennedy-132940 (Learn more about this complex data via this amazing presentation by Katie Dragon & Daniel Kennedy ^)
- The Parameter Perturbation Experiment data leveraged by our project was generated utilizing the Community Land Model v5 (CLM5) predictions. https://www.earthsystemgrid.org/dataset/ucar.cgd.ccsm4.CLM_LAND_ONLY.html
12. Recommended citation for the project:
Ingersoll Sofia, Childers Heather, Bhattarai Sujan. “Understanding the Influence of Parameter Value Uncertainty on Climate Model Output: Developing an Interactive Dashboard”. (2024). University of California, Santa Barbara – Bren School of Environmental Science & Management. https://github.com/GaiaFuture/CLM5_PPE_Emulator
DATA & FILE OVERVIEW
13. File List:
Accepted Parameter Values
Content: The txt file contains 500 accepted values for the 32 parameter features provided by the CGD – NCAR. These values were used to train the emulator for predicting parameter sensitivity and uncertainty.
Preprocessed Data Folder
Content: Cleaned subsets of 500 simulations from the PPE Latin Hypercube (LHC) data sets that are suitable for usage in a Gaussian Process Regression Machine Learning (GPR ML) Emulator. The subsets differ by user selected time range and climate variable of interest. The time ranges include: 1995-2015, 2000-2015, 2005-2015, 2010-2015. There are 11 climate variables in this folder, each containing a full set of time range subsets. The list includes: GPP, NBP, TOTVEGC, TLAI, EFLX_LH_TOT, SOILWATER_10CM, QRUNOFF, FSR, FAREA_BURNED, SNOWDP, LNC.
These were selected due to popularity and ‘LNC’ was selected to assist with quality assurance measures that will be covered in detail further below.
File naming convention: f"{var}_{time_selection}.nc"
File attributes: The dimensions of the preprocessed LHC perturbed parameter simulations are (500,1)
Units Dictionary
Content: Units for the 11 user selected climate variables
File naming convention: units_dict.pkl
Attributes: Attributes and structures defined in this notebook.
Emulation Results Data
Content: Files contain our ‘pickled emulator’. In other words, the files contain character streams of the data generated by the trained GPR ML Emulator that are easy for your disk to unpack and use for more predictions and visualizations.
File naming convention : f"{var_name}*{param_name}*{time_selection}_gpr_model.sav”
File attributes: results_dict
`
{python}
results_dict = {
# perturbed parameter test values
'X_values': {},
# climate variable predictions
'y_pred': {},
# climate variable predictions standard deviation
'y_std': {},
# the trained GPR ML model
'gpr_model': gpr_model,
# y_test to use to calculate R^2 later
'y_test': y_test,
‘X_test’: X_test
`
Ylims Dictionary
Content: ylim boundaries of variable predictions to use for plotting
File naming convention: ylim_dict.pkl
Attributes: Attributes and structures defined in this notebook.
Plots
Content:
File naming convention:f"fast_acc_plot_{var_name}*{param_name}*{time_selection}.png"
Content:
File naming convention: f"emulator_plot_{var_name}*{param_name_upper}*{time_selection}_gpr_model.png"
14. Relationship between files, if important:
The preprocessed_data
files are the inputs utilized to generate the emulated_data
. Each preprocessed file was used to train the GPR ML emulator to produce a respective output for each climate variable and parameter for the time period of interest. The predicted values were then used to create the visualizations in the PNGS.
15. Additional related data collected that was not included in the current data package: N/A
16. Are there multiple versions of the dataset? N/A
METHODOLOGICAL INFORMATION
17. Description of methods used for collection/generation of data:
Attributes and structures defined in this notebook outlines the workflow utilized to generate the data in this repo. It pulls functions from this utils.py to execute the desired commands. Below we will look at the utils functions that are not explicitly defined in the notebook. – General side note: if you decide to explore that Attributes and structures defined in this notebook explaining how the data was made, you’ll notice you’ll be transported to another repo in this Organization:GaiaFuture. That’s our prototype playground! It’s a little messy because that’s where we spent the second half of this project tinkering. The official repository is https://github.com/GaiaFuture/CLM5_PPE_Emulator.
18. Methods for processing the data:
Preprocessed Data
We were working inside of NCAR’s CASPER cluster HPC server, this enabled us direct access to the raw data files. We created a script to read in 500 LHC PPE simulations as a data set with inputs for a climate variable and time range. When reading in the cluster of simulations, there is a preprocess function that performs dimensional reduction to simplify the data set for wrangling later.
`
{python}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- cluster reading function ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# modify the function if you want to pass the parameter
# pull 20 years of data
# this has cool potential to have time period subset option
# indexing the selected lists or using multiples of 500
def read_all_simulations(var, time_selection):
'''Prepare cluster list and read to create ensemble(group of data)
Use preprocess to select only certain dimension and a variable'''
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#---- Define list of cluster lists ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# each cluster list contains 500 simulations to call
cluster_lists = [
# 1995 - 2000
sorted(glob.glob('/glade/campaign/cgd/tss/projects/PPE/PPEn11_LHC/transient/hist/PPEn11_transient_LHC[0][0-5][0-9][0-9].clm2.h0.1995-02-01-00000.nc'))[1:],
# 2000 - 2005
sorted(glob.glob('/glade/campaign/cgd/tss/projects/PPE/PPEn11_LHC/transient/hist/PPEn11_transient_LHC[0][0-5][0-9][0-9].clm2.h0.2000-02-01-00000.nc'))[1:],
# 2005 - 2010
sorted(glob.glob('/glade/campaign/cgd/tss/projects/PPE/PPEn11_LHC/transient/hist/PPEn11_transient_LHC[0][0-5][0-9][0-9].clm2.h0.2005-02-01-00000.nc'))[1:],
# 2010 - 2015
sorted(glob.glob('/glade/campaign/cgd/tss/projects/PPE/PPEn11_LHC/transient/hist/PPEn11_transient_LHC[0][0-5][0-9][0-9].clm2.h0.2010-02-01-00000.nc'))[1:]
]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#---- Prepping to Load Cluster ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def preprocess(ds, var):
'''using this function in xr.open_mfdataset as preprocess
ensures that when only these four things are selected
before the data is combined'''
return ds[['lat', 'lon', 'time', var]]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#---- If-else Load Selected Time ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Select appropriate lists based on the time_selection
# Select appropriate lists based on the time_selection
if time_selection == '2010-2015':
selected_files = cluster_lists[3]
# Read the list and load it for the notebook using combine='nested'
ds = xr.open_mfdataset(selected_files,
combine='nested',
preprocess=lambda ds: preprocess(ds, var),
parallel=True,
concat_dim=["ens"])
else:
# up to list 1 aka 0 bc python index
# python end exclusive so need to go up to 4 for all
if time_selection == '1995-2015':
selected_lists = cluster_lists[:4]
# up to list 2 aka 1 bc python index
elif time_selection == '2000-2015':
selected_lists = cluster_lists[1:4]
# up to list 3 aka 2 bc python index
elif time_selection == '2005-2015':
selected_lists = cluster_lists[2:4]
# safety check
else:
# to ensure a user selects time range
raise ValueError("Uh oh, please select a time range that is currently available.")
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#---- Load in Cluster Data ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# read the list and load it for the notebook
ds = xr.open_mfdataset( selected_lists,
combine='nested',
# lambda allows us to call the predefined preprocess on the ds
preprocess = lambda ds: preprocess(ds, var),
parallel= True,
concat_dim= ["time", "ens"])
# we aren't going to save these files because they need to be preprocessed
# using the wrangle and subset functions
# better to keep these things broken up / shorter for future works updates
# makes sense to keep if else pulling statement at the top of read_n_wrangle
return ds
`
Once the data sets of interest were loaded, they were then ready for some dimensional corrections – some quirks that come with using CESM data. Our friend’s at NCAR CGDL actually provided us with the correct time-paring bug. The other functions to weigh each grid cell by land area, properly weigh each month according to their contribution to the number of days in a year, and to calculate the global average of each simulation were generated by our team to wrangle the data so it is suitable for emulation. These files were saved so they could be leveraged later using a built-in if-else statement within the read_n_wrangle()
function.
`
{python}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- correct time-parsing bug ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def fix_time(da):
'''fix CESM monthly time-parsing bug'''
yr0 = str(da['time.year'][0].values)
da['time'] = xr.cftime_range(yr0,periods=len(da.time),freq='MS',calendar='noleap')
return da
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- weigh dummy landarea by gridcell ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#---- Weight Gridcells by Landarea ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def weight_landarea_gridcells(da,landarea):
# weigh landarea variable by mean of gridcell dimension
return da.weighted(landarea).mean(dim = 'gridcell')
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- weight var data time dim ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#------Weighted Averages by Time---
def yearly_weighted_average(da):
# Get the array of number of days from the main dataset
days_in_month = da['time.daysinmonth']
# Multiply each month's data by corresponding days in month
weighted_month = (days_in_month*da).groupby("time.year").sum(dim = 'time')
# Total days in the year
days_per_year = days_in_month.groupby("time.year").sum(dim = 'time')
# Calculate weighted average for the year
return weighted_month / days_per_year
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Subset Var Wrangle Funct ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def wrangle_var_cluster(da):
'''Weight gridcell dimension by landarea
and weight time dimension by the days per each month
over the total number of days in a year. Globally average
the selected variable between 2005-2010
[for now, will be time range]
as a xr.da.'''
# weight gridcell dim by global land area
da_global = weight_landarea_gridcells(da, landarea)
# weight time dim by days in month
da_global_ann = yearly_weighted_average(da_global)
# take global avg for variable over year dimension
var = da_global_ann.mean(dim='year')
return var
`
All of these functions above are packaged into a single function read_n_wrangle()
to load the data, reduce its dimensionality, properly weigh the time and grid cell dimensions, saves the user parameter and variable name of interest to call later and the data set in a folder for us to access later using a built-in if-else statement. Additionally, it collects some metrics for quality assurance purposes.
`
{python}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Wrangle Data ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def read_n_wrangle(param, var, time_selection):
'''Read in LHC subset for a user selected variable and time selection,
Also load the accepted parameter values provided by NCAR'''
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Parameter Data. ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# store user-inputs as global variables
# will call later for plotting
global param_name, var_name
param_name = param
var_name = var
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Parameter Data. ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# pull in parameter data
params = param_wrangling()
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- If-else Load Data ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
filepath = os.path.join("Results/preprocessed_data", f"{var}_{time_selection}.nc")
if os.path.exists(filepath):
#read in the file as a dataset
ds=xr.open_dataset('Results/preprocessed_data/'+var+'_'+time_selection+'.nc')
# then convert back to data array
var_avg = ds[var]
else:
print(f"Reading and wrangling your data, this may take a few minutes")
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Subset User Selection Funct ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
var_da = subset_var_cluster(var, time_selection)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Subset Var Wrangle Funct ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
var_avg = wrangle_var_cluster(var_da)
# you ought to convert the data array to dataset before writing to file
ds = var_avg.to_dataset(name = var)
# note that this will throw error if you try to overwrite existing files
ds.to_netcdf('Results/preprocessed_data/'+var+'_'+time_selection+'.nc')
return params, var_avg, param_name, var_name
`
Emulator Data
The preprocessed data is then used in the GPR ML Emulator to make 100 predictions for a climate variable of interest and 32 individual parameters. To summarize briefly without getting too into the nitty gritty, our GPR emulator does 3 things:
Simplifies the LHC data so it can look at 1 parameter at a time and assess its relationship with a climate variable
Applies Fourier Amplitude Sensitivity Analysis to identify relationships between parameters and climate variables. It helps us see what the key influencers are.
In the full chaotic LHC, it can assess the covariance of the parameter-parameter predictions simultaneously (this is the R^2 value you’ll see on your accuracy inset plot later)
Additionally, it ‘pickles’ and saves the predictions and trained gpr_model so they can be utilized for further analysis, exploration, and visualizations.
`
{python}
def train_emulator(param, var, var_name, time_selection):
''Train the Gaussian Process Machine Learning Emulator, pickle it,
along with the predictions + std for a climate variable vs
each of the 32 parameters. These will be used later for plotting &
determining ylims'''
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Split Data ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
X_train, X_test, y_train, y_test = train_test_split(param,
var,
test_size=0.2,
random_state=0)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Kernel Specs No Tuning ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Kernel Specs No Tuning
kernel = ConstantKernel(constant_value=3, constant_value_bounds=(1e-2, 1e4)) \
* RBF(length_scale=1, length_scale_bounds=(1e-4, 1e8))
# Using an out-of-the-box kernel for now
gpr_model = GaussianProcessRegressor(kernel=kernel,
n_restarts_optimizer=20,
random_state=99,
normalize_y=True)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Fit the Model ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Fit the model to the training data
gpr_model.fit(X_train, y_train)
# Prepare to store results
results_dict = {
'X_values': {},
'y_pred': {},
'y_std': {},
'r2': {},
# save the trained GPR model
'gpr_model': gpr_model,
# save y_test for R^2 later
'y_test': y_test,
'X_test': X_test
}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Iterate thru Params ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
for param_name, param_index in param_names_dict.items():
# Create X_values for prediction linspace
X_values = np.full((100, len(param_names_dict)), 0.5) # r2 drops to 0.004 when removing this, but we're only using the R^2 used in fast plot
# X_values = np.tile(X_test, 1)
X_values[:, param_index] = np.linspace(0, 1, 100)
# Vary only the current parameter over a linspace
#X_values[:, param_index] = np.linspace(np.min(X_test[:, param_index]), np.max(X_test[:, param_index]))
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Get Predictions ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Make predictions for the current parameter
y_pred, y_std = gpr_model.predict(X_values, return_std=True)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Collect Metrics ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mae = mean_absolute_error(y_test, y_pred)
rmse_emulator = np.sqrt(mean_squared_error(y_test, y_pred))
r2_emulator = np.corrcoef(y_test, y_pred)[0, 1]**2
# Store results in dictionaries
results_dict['X_values'][param_name] = X_values
results_dict['y_pred'][param_name] = y_pred
results_dict['y_std'][param_name] = y_std
results_dict['r2'][param_name] = r2_emulator
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Pickle Emulation ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Save the predictions and overall R^2 to a file
filename = os.path.join("emulation_results", f"{var_name}_{param_name}_{time_selection}_gpr_model.sav")
if os.path.exists(filename):
# Load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
else:
print(f"Emulator is running for {param_name}, this may take a few moments")
with open(filename, 'wb') as file:
pickle.dump((X_values, y_pred, y_std, r2_emulator, param_name, var_name), file)
return results_dict
`
These 100 predictions for each relationship were stored in a pickled dictionary that are then applied in the plotting functions below that were used to generate the pngs.
Plots
These display the predicted relationship between a climate variable and perturbed parameter alongside 3 std of uncertainty – giving us a 99.7% confidence interval. These are to aid in climate model calibration.
`
{python}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Plot Emulator ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create an array that sets the value of all 32 parameters to 0.5
# this will be used when plotting emulation
def plot_emulator(results_dict, var_name, param_name, param_names_dict, time_selection, global_min, global_max):
# Load pickled units dictionary
filename = os.path.join("Results/units_dict.sav")
units_dict = pickle.load(open(filename, 'rb'))
# Retrieve the units for the specific variable
units = units_dict.get(var_name, {}).get('units', 'Unknown units')
# Convert param_name to uppercase to match the filenames
param_name_upper = param_name.upper()
# Retrieve the data for the specific parameter
X_values = results_dict['X_values'][param_name_upper]
y_pred = results_dict['y_pred'][param_name_upper]
y_std = results_dict['y_std'][param_name_upper]
# Get the parameter index corresponding to the name
indexed_param = param_names_dict[param_name_upper]
# Calculate the z-score for the 99.7% confidence interval
z_score = norm.ppf(0.99865)
# Plot the results
plt.figure(figsize=(10, 6))
plt.rcParams['font.family'] = 'Roboto' # Load the Roboto font
plt.style.use('dark_background') # Set the style to a dark theme
plt.plot(X_values[:, indexed_param],
y_pred,
color='white',
linewidth=3,
label='GPR Prediction')
# Apply z-score for 99.7% CI
plt.fill_between(X_values[:, indexed_param],
y_pred - z_score * y_std, y_pred + z_score * y_std,
alpha=0.5,
color='#62c900ff',
label='3 St.Dev., Confidence Interval')
plt.xlabel(f'Perturbed Parameter: {param_name.title()}', size=18)
plt.ylabel(f'Climate Variable: {var_name.split("_")[0].title()} {units}', size=18)
plt.title(f'Parameter Sensitivity and Uncertainty Estimation \nAssessing Global Annual Means {time_selection}', size=24)
plt.legend(fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=14)
# Set y limits based on global min and max
plt.ylim(global_min, global_max)
# Save the plot as a PNG file
plot_dir = 'Results/plots/emulator'
os.makedirs(plot_dir, exist_ok=True)
plt.savefig(os.path.join(plot_dir, f'emulator_plot_{var_name}_{param_name_upper}_{time_selection}_gpr_model.png'))
plt.tight_layout()
plt.show()
`
These display the parameter sensitivity analysis to aid climate scientists in the determination of key environmental influencers. It leverages the information provided by the emulation_data.
`
{python}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Plot FAST Accuracy ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def plot_FAST_accuracy(results_dict, var_name, param_name, time_selection):
# Retrieve the data for the specific variable
units = units_dict.get(var_name, {}).get('units', 'Unknown units')
# Convert param_name to uppercase to match the filenames
param_name_upper = param_name.upper()
# Retrieve the gpr_model from results_dict
gpr_model = results_dict['gpr_model']
# Retrieve the data for the specific parameter
y_pred = results_dict['y_pred'][param_name_upper]
y_std = results_dict['y_std'][param_name_upper]
y_test = results_dict['y_test']
# Retrieve unweighted X_test to make new predictions
X_test = results_dict['X_test']
# Make new predictions with unweighted parameter data
y_pred_full = gpr_model.predict(X_test)
# Calculate new r² score
r2_emulator = np.corrcoef(y_test, y_pred_full)[0, 1]**2
def gaussian_regression_lines(gpr_model):
fourier_amplitudes = []
for param_index in range(32):
X_values = np.full((10, 32), 0.5)
X_values[:, param_index] = np.linspace(0, 1, 10)
y_pred, y_std = gpr_model.predict(X_values, return_std=True)
y_fft = fft(y_pred)
amplitude = np.abs(y_fft)
fourier_amplitudes.append(amplitude[1])
return fourier_amplitudes
fourier_amplitudes = gaussian_regression_lines(gpr_model)
sorted_indices = np.argsort(fourier_amplitudes)
sorted_fourier_amplitudes = np.array(fourier_amplitudes)[sorted_indices]
swapped_param_keys = {v: k for k, v in create_parameter_names_dict().items()}
sorted_parameter_names = [swapped_param_keys[index] for index in sorted_indices]
fig, ax = plt.subplots(figsize=(16, 8))
# Load the Roboto font
plt.rcParams['font.family'] = 'Roboto'
ax.barh(range(len(sorted_fourier_amplitudes)), sorted_fourier_amplitudes, color='#62c900ff', alpha=0.5)
ax.set_ylabel('')
ax.set_xlabel('Fourier Amplitude Sensitivity Test (FAST)', size=18)
ax.set_title(f'Parameter Sensitivty Analysis for {var_name} {units}', size=24, weight='bold')
ax.set_yticks(range(len(sorted_fourier_amplitudes)), sorted_parameter_names)
ax.grid(axis='x', linestyle='--', alpha=0.7)
ax.set_aspect('auto', adjustable='box')
# Create inset for accuracy plot
ax_inset = inset_axes(ax, width="40%", height="40%", loc='center right')
ax_inset.errorbar(y_test, y_pred_full, yerr=3*y_std, fmt="o", color='#62c900ff', alpha=0.5)
ax_inset.plot([0, np.max(y_test)], [0, np.max(y_pred_full)], linestyle='--', color='white')
ax_inset.set_xlim([np.min(y_test)-1, np.max(y_test)+1])
ax_inset.set_ylim([np.min(y_pred_full)-1, np.max(y_pred_full)+1])
ax_inset.set_xlabel(f'{var_name} Test', size=16)
ax_inset.set_ylabel(f'Emulated Variable: {var_name} {units}', size=16)
ax_inset.set_title(f'Emulator Accuracy: {var_name} {units} \nAssessing Global Annual Means {time_selection}', size=18, weight='bold')
ax_inset.text(0.5, 0.1, f'R² Score = {np.round(r2_emulator, 2)}', fontsize=12, \
transform=ax_inset.transAxes, horizontalalignment='center', weight='bold')
# Save the plot as a PNG file
plot_dir = 'Results/plots/fast_accuracy'
os.makedirs(plot_dir, exist_ok=True)
plt.savefig(os.path.join(plot_dir, f'fast_acc_plot_{var_name}_{param_name}.png'))
# plt.tight_layout()
return plt.show()
`
18. Instrument- or software-specific information needed to interpret the data:
- cartopy=0.22.0
- dask=2024.1.0
- dask-jobqueue=0.8.5
- gpflow=2.5.2
- hvplot=0.9.2
- ipython=8.22.2
- jupyter=1.0.0
- jupyterlab=4.1.5
- netcdf4=1.6.5
- notebook=7.1.2
- numpy=1.26.4
- pandas=2.2.1
- panel=1.3.8
- plotly=5.19.0
- python=3.11.7
- regionmask=0.12.1
- scikit-learn=1.4.1
- scipy=1.12.0
- seaborn=0.13.1
- shapely=2.0.3
- sparse=0.15.1
- statsmodels=0.14.1
- xarray=2024.2.0
- xesmf=0.8.4
19. Standards and calibration information, if appropriate:
Png of all the visualizations were made as a form of quality assurance to assess how the emulator was performing. The fast_acc_plot
contains an R^2 value that was used to determine the overall accuracy of the prediction. The standard of the R^2 value provided by the emulator when assessing the total LHC parameter space was above 0.65. The kernel specification below enables the GPR ML emulator to assess the covariance of the 32 parameters simultaneously to produce predictions for a given climate variable.
20. Environmental/experimental conditions:
Once the Technical Documentation for this project is available, there is a significant section in the Solution Design dedicated to different kernel components that may be combined to better optimize the emulator.
Kernel specification for GPR:
`
{python}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ---- Kernel Specs No Tuning ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Kernel Specs No Tuning
kernel = ConstantKernel(constant_value=3, constant_value_bounds=(1e-2, 1e4)) \
* RBF(length_scale=1, length_scale_bounds=(1e-4, 1e8))
`
21. Describe any quality-assurance procedures performed on the data:
The creation of the png files was a measure to assure the GPR ML emulator predictions are performing as expected. Our team used the relationship of LNC and leafcn to assess the performance of our emulator because of the highly linear relationship. It was easiest to identify this correlation and affirm our model was working as expected.
22. People involved with sample collection, processing, analysis and/or submission:
Principal Investigator
Name: Sofia Ingersoll
Co-Investigators
Name: Sujan Bhattarai
Name: Heather Childers
24. DATA-SPECIFIC INFORMATION FOR:
`Accepted Parameter Values`
(500,32) accepted parameter values. 1 file
`Preprocessed_data`
(500,1) perturbed parameter LHC simulations. 44 files
`units_dict`
Pickled dictionary of 11 climate variables of interest. 1 file. Attributes and structures defined in this notebook
`Emulation_data`
results_dict is a dictionary that contains the 100 predictions + uncertainty, necessary values to make new predictions, and the trained emulator for all 11 climate variables against each of the 32 parameters. 320 total pickled files.
`ylim_dict`
Pickled dictionary of ylim boundaries for the 11 climate variables of interest. 1 file. Attributes and structures defined in this notebook
`Plots`
\`Emulation Uncertainty plot`
PNG object
\`FAST accuracy plot`
PNG object
25. Variable List:
See the file units_dict
for a list of variables and their respective units. The parameters are normalized between 0 and 1, so units are not applicable in this instance. A description of the variables may be accessed using this link:
Table of specifications for the Community Land Model v5 (CLM5) https://www.cesm.ucar.edu/models/clm/data
26. Missing data codes:
There isn’t any present in this data, however there is opportunity for this in one section of the workflow outlined in the notebook for units_dict
. The output would look something like this:
`units_dict`:[ var_name: units: ’Unknown’ ]
If no units are present in the ds you’re reading from.
27. Specialized formats or other abbreviations used:
This table is a life line when deciphering the CESM data abbreviations. It is important to note that the dashboard will include a table of the 32 parameters and top 10 most common climate variables with a brief description of each, their Earth systems category, and
Table of specifications for the Community Land Model v5 (CLM5) https://www.cesm.ucar.edu/models/clm/data
Methods
Data Source:
- University of California, Santa Barbara – Climate and Global Dynamics Lab, National Center for Atmospheric Research: Parameter Perturbation Experiment (CGD NCAR PPE-5). https://webext.cgd.ucar.edu/I2000/PPEn11_OAAT/ (Only public version of the data currently accessible. Data leveraged in this project is currently stored on the NCAR server and is not publicly available), https://www.cgd.ucar.edu/events/seminar/2023/katie-dagon-and-daniel-kennedy-132940 (Learn more about this complex data via this amazing presentation by Katie Dragon & Daniel Kennedy ^)
- The Parameter Perturbation Experiment data leveraged by our project was generated utilizing the Community Land Model v5 (CLM5) predictions. https://www.earthsystemgrid.org/dataset/ucar.cgd.ccsm4.CLM_LAND_ONLY.html
Data Processing:
We were working inside of NCAR’s CASPER cluster HPC server, this enabled us direct access to the raw data files. We created a script to read in 500 LHC PPE simulations as a data set with inputs for a climate variable and time range. When reading in the cluster of simulations, there is a preprocess function that performs dimensional reduction to simplify the data set for wrangling later.
Once the data sets of interest were loaded, they were then ready for some dimensional corrections – some quirks that come with using CESM data. Our friend’s at NCAR CGDL actually provided us with the correct time-paring bug. The other functions to weigh each grid cell by land area, properly weigh each month according to their contribution to the number of days in a year, and to calculate the global average of each simulation were generated by our team to wrangle the data so it is suitable for emulation. These files were saved so they could be leveraged later using a built-in if-else statement within the `read_n_wrangle()` function.
The preprocessed data is then used in the GPR ML Emulator to make 100 predictions for a climate variable of interest and 32 individual parameters. To summarize briefly without getting too into the nitty gritty, our GPR emulator does 3 things:
- Simplifies the LHC data so it can look at 1 parameter at a time and assess its relationship with a climate variable.
- Applies Fourier Amplitude Sensitivity Analysis to identify relationships between parameters and climate variables. It helps us see what the key influencers are.
- In the full chaotic LHC, it can assess the covariance of the parameter-parameter predictions simultaneously (this is the R^2 value you’ll see on your accuracy inset plot later)
Additionally, it ‘pickles’ and saves the predictions and trained gpr_model so they can be utilized for further analysis, exploration, and visualizations.
Attributes and structures defined in this notebook outlines the workflow utilized to generate the data in this repo. It pulls functions from this utils.py to execute the desired commands. Below we will look at the utils.py functions that are not explicitly defined in the notebook. – General side note: if you decide to explore that Attributes and structures defined in this notebook explaining how the data was made, you’ll notice you’ll be transported to another repo in this Organization: GaiaFuture. That’s our prototype playground! It’s a little messy because that’s where we spent the second half of this project tinkering. The official repository is https://github.com/GaiaFuture/CLM5_PPE_Emulator.