A toy statistical model of ice thickness#

This notebook primarily introduces the gis.gridded_attributes and gis.gridded_mb_attributes tasks which are simply adding certain diagnostic variables to the glacier maps such as slope, aspect, distance from border, and other mass-balance related variables which might be useful to some people.

We also show how to use these data to build a very simple (and not very good) regression model of ice thickness. This second part (“Build a statistical model”) is rather technical and not directly related to OGGM, so you might stop at this point unless your are into the topic yourself. You will need to install scikit-learn for the second part of this notebook.

Set-up#

We are going to use the South Glacier example taken from the ITMIX experiment. We’ll also use the data provided with it.

## Libs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
import salem

# OGGM
import oggm.cfg as cfg
from oggm import utils, workflow, tasks, graphics
# Initialize OGGM and set up the default run parameters
cfg.initialize(logging_level='WARNING')
cfg.PARAMS['use_multiprocessing'] = False
# Local working directory (where OGGM will write its output)
cfg.PATHS['working_dir'] = utils.gettempdir('OGGM_Toy_Thickness_Model')
# We start at level 3, because we need all data for the attributes
gdirs = workflow.init_glacier_directories(['RGI60-01.16195'], from_prepro_level=3, prepro_border=10)
2023-03-07 12:46:49: oggm.cfg: Reading default parameters from the OGGM `params.cfg` configuration file.
2023-03-07 12:46:49: oggm.cfg: Multiprocessing switched OFF according to the parameter file.
2023-03-07 12:46:49: oggm.cfg: Multiprocessing: using all available processors (N=2)
2023-03-07 12:46:49: oggm.workflow: init_glacier_directories from prepro level 3 on 1 glaciers.
2023-03-07 12:46:49: oggm.workflow: Execute entity tasks [gdir_from_prepro] on 1 glaciers
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 7
      5 cfg.PATHS['working_dir'] = utils.gettempdir('OGGM_Toy_Thickness_Model')
      6 # We start at level 3, because we need all data for the attributes
----> 7 gdirs = workflow.init_glacier_directories(['RGI60-01.16195'], from_prepro_level=3, prepro_border=10)

File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/workflow.py:533, in init_glacier_directories(rgidf, reset, force, from_prepro_level, prepro_border, prepro_rgi_version, prepro_base_url, from_tar, delete_tar)
    531     if cfg.PARAMS['dl_verify']:
    532         utils.get_dl_verify_data('cluster.klima.uni-bremen.de')
--> 533     gdirs = execute_entity_task(gdir_from_prepro, entities,
    534                                 from_prepro_level=from_prepro_level,
    535                                 prepro_border=prepro_border,
    536                                 prepro_rgi_version=prepro_rgi_version,
    537                                 base_url=prepro_base_url)
    538 else:
    539     # We can set the intersects file automatically here
    540     if (cfg.PARAMS['use_intersects'] and
    541             len(cfg.PARAMS['intersects_gdf']) == 0 and
    542             not from_tar):

File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/workflow.py:191, in execute_entity_task(task, gdirs, **kwargs)
    187     if ng > 3:
    188         log.workflow('WARNING: you are trying to run an entity task on '
    189                      '%d glaciers with multiprocessing turned off. OGGM '
    190                      'will run faster with multiprocessing turned on.', ng)
--> 191     out = [pc(gdir) for gdir in gdirs]
    193 return out

File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/workflow.py:191, in <listcomp>(.0)
    187     if ng > 3:
    188         log.workflow('WARNING: you are trying to run an entity task on '
    189                      '%d glaciers with multiprocessing turned off. OGGM '
    190                      'will run faster with multiprocessing turned on.', ng)
--> 191     out = [pc(gdir) for gdir in gdirs]
    193 return out

File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/workflow.py:108, in _pickle_copier.__call__(self, arg)
    106 for func in self.call_func:
    107     func, kwargs = func
--> 108     res = self._call_internal(func, arg, kwargs)
    109 return res

File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/workflow.py:102, in _pickle_copier._call_internal(self, call_func, gdir, kwargs)
     99     gdir, gdir_kwargs = gdir
    100     kwargs.update(gdir_kwargs)
--> 102 return call_func(gdir, **kwargs)

File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/workflow.py:251, in gdir_from_prepro(entity, from_prepro_level, prepro_border, prepro_rgi_version, base_url)
    248 tar_base = utils.get_prepro_gdir(prepro_rgi_version, rid, prepro_border,
    249                                  from_prepro_level, base_url=base_url)
    250 from_tar = os.path.join(tar_base.replace('.tar', ''), rid + '.tar.gz')
--> 251 return oggm.GlacierDirectory(entity, from_tar=from_tar)

File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/utils/_workflow.py:2529, in GlacierDirectory.__init__(self, rgi_entity, base_dir, reset, from_tar, delete_tar)
   2525     raise RuntimeError('RGI Version not supported: '
   2526                        '{}'.format(self.rgi_version))
   2528 # remove spurious characters and trailing blanks
-> 2529 self.name = filter_rgi_name(name)
   2531 # region
   2532 reg_names, subreg_names = parse_rgi_meta(version=self.rgi_version[0])

File /usr/local/pyenv/versions/3.10.10/lib/python3.10/site-packages/oggm/utils/_funcs.py:734, in filter_rgi_name(name)
    728 def filter_rgi_name(name):
    729     """Remove spurious characters and trailing blanks from RGI glacier name.
    730 
    731     This seems to be unnecessary with RGI V6
    732     """
--> 734     if name is None or len(name) == 0:
    735         return ''
    737     if name[-1] in ['À', 'È', 'è', '\x9c', '3', 'Ð', '°', '¾',
    738                     '\r', '\x93', '¤', '0', '`', '/', 'C', '@',
    739                     'Å', '\x06', '\x10', '^', 'å', ';']:

TypeError: object of type 'numpy.float64' has no len()

These are the tasks adding the attributes to the gridded file:

# Tested tasks
task_list = [
    tasks.gridded_attributes,
    tasks.gridded_mb_attributes,
]
for task in task_list:
    workflow.execute_entity_task(task, gdirs)
# Pick our glacier
gdir = gdirs[0]

The gridded attributes#

Let’s open the gridded data file with xarray:

ds = xr.open_dataset(gdir.get_filepath('gridded_data'))
# List all variables
ds

The file contains several variables with their description. For example:

ds.oggm_mb_above_z

Let’s plot a few of them (we show how to plot them with xarray and with oggm:

ds.slope.plot();
plt.axis('equal');
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
graphics.plot_raster(gdir, var_name='aspect', cmap='twilight', ax=ax1)
graphics.plot_raster(gdir, var_name='oggm_mb_above_z', ax=ax2)

Retrieve these attributes at point locations#

For this glacier, we have ice thickness observations (all downloaded from the supplementary material of the ITMIX paper). Let’s make a table out of it:

df = salem.read_shapefile(utils.get_demo_file('IceThick_SouthGlacier.shp'))
coords = np.array([p.xy for p in df.geometry]).squeeze()
df['lon'] = coords[:, 0]
df['lat'] = coords[:, 1]
df = df[['lon', 'lat', 'thick']]

Convert the longitudes and latitudes to the glacier map projection:

xx, yy = salem.transform_proj(salem.wgs84, gdir.grid.proj, df['lon'].values, df['lat'].values)
df['x'] = xx
df['y'] = yy

And plot these data:

geom = gdir.read_shapefile('outlines')
f, ax = plt.subplots()
df.plot.scatter(x='x', y='y', c='thick', cmap='viridis', s=10, ax=ax);
geom.plot(ax=ax, facecolor='none', edgecolor='k');

Method 1: interpolation#

The measurement points of this dataset are very frequent and close to each other. There are plenty of them:

len(df)

Here, we will keep them all and interpolate the variables of interest at the point’s location. We use xarray for this:

vns = ['topo',
       'slope',
       'slope_factor',
       'aspect',
       'dis_from_border',
       'catchment_area',
       'lin_mb_above_z',
       'lin_mb_above_z_on_catch',
       'oggm_mb_above_z',
       'oggm_mb_above_z_on_catch',
       ]
# Interpolate (bilinear)
for vn in vns:
    df[vn] = ds[vn].interp(x=('z', df.x), y=('z', df.y))

Let’s see how these variables can explain the measured ice thickness:

f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 5))
df.plot.scatter(x='dis_from_border', y='thick', ax=ax1);
df.plot.scatter(x='slope', y='thick', ax=ax2);
df.plot.scatter(x='oggm_mb_above_z', y='thick', ax=ax3);

There is a negative correlation with slope (as expected), a positive correlation with the mass-flux (oggm_mb_above_z), and a slight positive correlation with the distance from the glacier boundaries.

Method 2: aggregated per grid point#

There are so many points that much of the information obtained by OGGM is interpolated. A way to deal with this is to aggregate all the measurement points per grid point and to average them. Let’s do this:

df_agg = df[['lon', 'lat', 'thick']].copy()
ii, jj = gdir.grid.transform(df['lon'], df['lat'], crs=salem.wgs84, nearest=True)
df_agg['i'] = ii
df_agg['j'] = jj
# We trick by creating an index of similar i's and j's
df_agg['ij'] = ['{:04d}_{:04d}'.format(i, j) for i, j in zip(ii, jj)]
df_agg = df_agg.groupby('ij').mean()
# Conversion does not preserve ints
df_agg['i'] = df_agg['i'].astype(int)
df_agg['j'] = df_agg['j'].astype(int)
len(df_agg)
# Select
for vn in vns:
    df_agg[vn] = ds[vn].isel(x=('z', df_agg['i']), y=('z', df_agg['j']))

We now have 9 times less points, but the main features of the data remain unchanged:

f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 5))
df_agg.plot.scatter(x='dis_from_border', y='thick', ax=ax1);
df_agg.plot.scatter(x='slope', y='thick', ax=ax2);
df_agg.plot.scatter(x='oggm_mb_above_z', y='thick', ax=ax3);

Build a statistical model#

Let’s use scikit-learn to build a very simple linear model of ice-thickness. First, we have to acknowledge that there is a correlation between many of the explanatory variables we will use:

import seaborn as sns
plt.figure(figsize=(10, 8));
sns.heatmap(df.corr(), cmap='RdBu');

This is a problem for linear regression models, which cannot deal well with correlated explanatory variables. We have to do a so-called “feature selection”, i.e. keep only the variables which are independently important to explain the outcome.

For the sake of simplicity, let’s use the Lasso method to help us out:

import warnings
warnings.filterwarnings("ignore")  # sklearn sends a lot of warnings
# Scikit learn (can be installed e.g. via conda install -c anaconda scikit-learn)
from sklearn.linear_model import LassoCV
from sklearn.model_selection import KFold
# Prepare our data
df = df.dropna()
# Variable to model
target = df['thick']
# Predictors - remove x and y (redundant with lon lat)
# Normalize it in order to be able to compare the factors
data = df.drop(['thick', 'x', 'y'], axis=1).copy()
data_mean = data.mean()
data_std = data.std()
data = (data - data_mean) / data_std
# Use cross-validation to select the regularization parameter
lasso_cv = LassoCV(cv=5, random_state=0)
lasso_cv.fit(data.values, target.values)

We now have a statistical model trained on the full dataset. Let’s see how it compares to the calibration data:

odf = df.copy()
odf['thick_predicted'] = lasso_cv.predict(data.values)
f, ax = plt.subplots(figsize=(6, 6));
odf.plot.scatter(x='thick', y='thick_predicted', ax=ax);
plt.xlim([-25, 220]);
plt.ylim([-25, 220]);

Not so well. It is doing OK for low thicknesses, but high thickness are strongly underestimated. Which explanatory variables did the model choose as being the most important?

predictors = pd.Series(lasso_cv.coef_, index=data.columns)
predictors

This is interesting. Lons and lats have a predictive power over thickness? Unfortunately, this is more a coincidence than a reality. If we look at the correlation of the error with the variables of importance, one sees that there is a large correlation between the error and the spatial variables:

odf['error'] = np.abs(odf['thick_predicted'] - odf['thick'])
odf.corr()['error'].loc[predictors.loc[predictors != 0].index]

Furthermore, the model is not very robust. Let’s use cross-validation to check our model parameters:

k_fold = KFold(4, random_state=0, shuffle=True)
for k, (train, test) in enumerate(k_fold.split(data.values, target.values)):
    lasso_cv.fit(data.values[train], target.values[train])
    print("[fold {0}] alpha: {1:.5f}, score (r^2): {2:.5f}".
          format(k, lasso_cv.alpha_, lasso_cv.score(data.values[test], target.values[test])))

The fact that the hyper-parameter alpha and the score change that much between iterations is a sign that the model isn’t very robust.

Our model is bad, but… let’s apply it#

In order to apply the model to our entre glacier, we have to get the explanatory variables from the gridded dataset again:

# Add variables we are missing
lon, lat = gdir.grid.ll_coordinates
ds['lon'] = (('y', 'x'), lon)
ds['lat'] = (('y', 'x'), lat)

# Generate our dataset
pred_data = pd.DataFrame()
for vn in data.columns:
    pred_data[vn] = ds[vn].data[ds.glacier_mask == 1]

# Normalize using the same normalization constants
pred_data = (pred_data - data_mean) / data_std

# Apply the model
pred_data['thick'] = lasso_cv.predict(pred_data.values)

# For the sake of physics, clip negative thickness values...
pred_data['thick'] = np.clip(pred_data['thick'], 0, None)
# Back to 2d and in xarray
var = ds[vn].data * np.NaN
var[ds.glacier_mask == 1] = pred_data['thick']
ds['linear_model_thick'] = (('y', 'x'), var)
ds['linear_model_thick'].attrs['description'] = 'Predicted thickness'
ds['linear_model_thick'].attrs['units'] = 'm'
ds['linear_model_thick'].plot();

Take home points#

  • we have shown how to compute gridded attributes from OGGM glaciers such as slope, aspect, or catchments

  • we used two methods to extract these data at point locations: with interpolation or with aggregated averages on each grid point

  • as an application example, we trained a linear regression model to predict the ice thickness of this glacier at unseen locations

The model we developed was quite bad and we used quite lousy statistics. If I had more time to make it better, I would:

  • make a pre-selection of meaningful predictors to avoid discontinuities

  • use a non-linear model

  • use cross-validation to better asses the true skill of the model

What’s next?#

Bonus: how does the statistical model compare to OGGM “out-of the box”?#

# Close the dataset cause we are going to write in it
ds = ds.load()
ds.close()
ds.to_netcdf(gdir.get_filepath('gridded_data'))
# Distribute thickness using default values only
workflow.execute_entity_task(tasks.distribute_thickness_per_altitude, gdirs);
# Add the linear model data for comparison
ds = xr.open_dataset(gdir.get_filepath('gridded_data'))
df_agg['oggm_thick'] = ds.distributed_thickness.isel(x=('z', df_agg['i']), y=('z', df_agg['j']))
df_agg['linear_model_thick'] = ds.linear_model_thick.isel(x=('z', df_agg['i']), y=('z', df_agg['j']))
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5));
ds['linear_model_thick'].plot(ax=ax1);
ds['distributed_thickness'].plot(ax=ax2);
plt.tight_layout();
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6));
df_agg.plot.scatter(x='thick', y='linear_model_thick', ax=ax1);
ax1.set_xlim([-25, 220]);
ax1.set_ylim([-25, 220]);
df_agg.plot.scatter(x='thick', y='oggm_thick', ax=ax2);
ax2.set_xlim([-25, 220]);
ax2.set_ylim([-25, 220]);