A toy statistical model of ice thickness

This notebook primarily introduces the gis.gridded_attributes and gis.gridded_mb_attributes tasks which are simply adding certain diagnostic variables to the glacier maps such as slope, aspect, distance from border, and other mass-balance related variables which might be useful to some people.

We also show how to use these data to build a very simple (and not very good) regression model of ice thickness. This second part (“Build a statistical model”) is rather technical and not directly related to OGGM, so you might stop at this point unless your are into the topic yourself. You will need to install scikit-learn for the second part of this notebook.

Set-up

We are going to use the South Glacier example taken from the ITMIX experiment. We’ll also use the data provided with it.

## Libs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
import salem

# OGGM
import oggm.cfg as cfg
from oggm import utils, workflow, tasks, graphics
# Initialize OGGM and set up the default run parameters
cfg.initialize(logging_level='WARNING')
cfg.PARAMS['use_multiprocessing'] = False
# Local working directory (where OGGM will write its output)
cfg.PATHS['working_dir'] = utils.gettempdir('OGGM_Toy_Thickness_Model')
# We start at level 3, because we need all data for the attributes
gdirs = workflow.init_glacier_directories(['RGI60-01.16195'], from_prepro_level=3, prepro_border=10)
2021-09-15 10:40:28: oggm.cfg: Reading default parameters from the OGGM `params.cfg` configuration file.
2021-09-15 10:40:28: oggm.cfg: Multiprocessing switched OFF according to the parameter file.
2021-09-15 10:40:28: oggm.cfg: Multiprocessing: using all available processors (N=2)
2021-09-15 10:40:28: oggm.workflow: init_glacier_directories from prepro level 3 on 1 glaciers.
2021-09-15 10:40:28: oggm.workflow: Execute entity task gdir_from_prepro on 1 glaciers

These are the tasks adding the attributes to the gridded file:

# Tested tasks
task_list = [
    tasks.gridded_attributes,
    tasks.gridded_mb_attributes,
]
for task in task_list:
    workflow.execute_entity_task(task, gdirs)
2021-09-15 10:40:36: oggm.workflow: Execute entity task gridded_attributes on 1 glaciers
2021-09-15 10:40:36: oggm.workflow: Execute entity task gridded_mb_attributes on 1 glaciers
# Pick our glacier
gdir = gdirs[0]

The gridded attributes

Let’s open the gridded data file with xarray:

ds = xr.open_dataset(gdir.get_filepath('gridded_data'))
# List all variables
ds
<xarray.Dataset>
Dimensions:                   (x: 105, y: 121)
Coordinates:
  * x                         (x) float32 -2.407e+03 -2.364e+03 ... 2.065e+03
  * y                         (y) float32 6.746e+06 6.745e+06 ... 6.74e+06
Data variables: (12/17)
    topo                      (y, x) float32 2.5e+03 2.505e+03 ... 2.118e+03
    topo_smoothed             (y, x) float32 2.507e+03 2.508e+03 ... 2.137e+03
    topo_valid_mask           (y, x) int8 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1
    glacier_mask              (y, x) int8 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    glacier_ext               (y, x) int8 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    glacier_ext_erosion       (y, x) int8 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    ...                        ...
    catchment_area            (y, x) float32 nan nan nan nan ... nan nan nan nan
    catchment_area_on_catch   (y, x) float32 nan nan nan nan ... nan nan nan nan
    lin_mb_above_z            (y, x) float32 nan nan nan nan ... nan nan nan nan
    lin_mb_above_z_on_catch   (y, x) float32 nan nan nan nan ... nan nan nan nan
    oggm_mb_above_z           (y, x) float32 nan nan nan nan ... nan nan nan nan
    oggm_mb_above_z_on_catch  (y, x) float32 nan nan nan nan ... nan nan nan nan
Attributes:
    author:         OGGM
    author_info:    Open Global Glacier Model
    pyproj_srs:     +proj=tmerc +lat_0=0 +lon_0=-139.131 +k=0.9996 +x_0=0 +y_...
    max_h_dem:      3158.8242
    min_h_dem:      1804.8026
    max_h_glacier:  3158.8242
    min_h_glacier:  1971.5164

The file contains several variables with their description. For example:

ds.oggm_mb_above_z
<xarray.DataArray 'oggm_mb_above_z' (y: 121, x: 105)>
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=float32)
Coordinates:
  * x        (x) float32 -2.407e+03 -2.364e+03 ... 2.022e+03 2.065e+03
  * y        (y) float32 6.746e+06 6.745e+06 6.745e+06 ... 6.74e+06 6.74e+06
Attributes:
    units:        kg/year
    long_name:    MB above point from OGGM MB model, without catchments
    description:  Mass-balance cumulated above the altitude of thepoint, henc...

Let’s plot a few of them (we show how to plot them with xarray and with oggm:

ds.slope.plot();
plt.axis('equal');
../_images/thickness_model_15_0.png
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
graphics.plot_raster(gdir, var_name='aspect', cmap='twilight', ax=ax1)
graphics.plot_raster(gdir, var_name='oggm_mb_above_z', ax=ax2)
../_images/thickness_model_16_0.png

Retrieve these attributes at point locations

For this glacier, we have ice thickness observations (all downloaded from the supplementary material of the ITMIX paper). Let’s make a table out of it:

df = salem.read_shapefile(utils.get_demo_file('IceThick_SouthGlacier.shp'))
coords = np.array([p.xy for p in df.geometry]).squeeze()
df['lon'] = coords[:, 0]
df['lat'] = coords[:, 1]
df = df[['lon', 'lat', 'thick']]

Convert the longitudes and latitudes to the glacier map projection:

xx, yy = salem.transform_proj(salem.wgs84, gdir.grid.proj, df['lon'].values, df['lat'].values)
df['x'] = xx
df['y'] = yy

And plot these data:

geom = gdir.read_shapefile('outlines')
f, ax = plt.subplots()
df.plot.scatter(x='x', y='y', c='thick', cmap='viridis', s=10, ax=ax);
geom.plot(ax=ax, facecolor='none', edgecolor='k');
../_images/thickness_model_23_0.png

Method 1: interpolation

The measurement points of this dataset are very frequent and close to each other. There are plenty of them:

len(df)
9619

Here, we will keep them all and interpolate the variables of interest at the point’s location. We use xarray for this:

vns = ['topo',
       'slope',
       'slope_factor',
       'aspect',
       'dis_from_border',
       'catchment_area',
       'lin_mb_above_z',
       'lin_mb_above_z_on_catch',
       'oggm_mb_above_z',
       'oggm_mb_above_z_on_catch',
       ]
# Interpolate (bilinear)
for vn in vns:
    df[vn] = ds[vn].interp(x=('z', df.x), y=('z', df.y))

Let’s see how these variables can explain the measured ice thickness:

f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 5))
df.plot.scatter(x='dis_from_border', y='thick', ax=ax1);
df.plot.scatter(x='slope', y='thick', ax=ax2);
df.plot.scatter(x='oggm_mb_above_z', y='thick', ax=ax3);
../_images/thickness_model_31_0.png

There is a negative correlation with slope (as expected), a positive correlation with the mass-flux (oggm_mb_above_z), and a slight positive correlation with the distance from the glacier boundaries.

Method 2: aggregated per grid point

There are so many points that much of the information obtained by OGGM is interpolated. A way to deal with this is to aggregate all the measurement points per grid point and to average them. Let’s do this:

df_agg = df[['lon', 'lat', 'thick']].copy()
ii, jj = gdir.grid.transform(df['lon'], df['lat'], crs=salem.wgs84, nearest=True)
df_agg['i'] = ii
df_agg['j'] = jj
# We trick by creating an index of similar i's and j's
df_agg['ij'] = ['{:04d}_{:04d}'.format(i, j) for i, j in zip(ii, jj)]
df_agg = df_agg.groupby('ij').mean()
len(df_agg)
1087
# Select
for vn in vns:
    df_agg[vn] = ds[vn].isel(x=('z', df_agg.i), y=('z', df_agg.j))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_1392/743934350.py in <module>
      1 # Select
      2 for vn in vns:
----> 3     df_agg[vn] = ds[vn].isel(x=('z', df_agg.i), y=('z', df_agg.j))

/usr/local/pyenv/versions/3.8.12/lib/python3.8/site-packages/xarray/core/dataarray.py in isel(self, indexers, drop, missing_dims, **indexers_kwargs)
   1181 
   1182         if any(is_fancy_indexer(idx) for idx in indexers.values()):
-> 1183             ds = self._to_temp_dataset()._isel_fancy(
   1184                 indexers, drop=drop, missing_dims=missing_dims
   1185             )

/usr/local/pyenv/versions/3.8.12/lib/python3.8/site-packages/xarray/core/dataset.py in _isel_fancy(self, indexers, drop, missing_dims)
   2381 
   2382             if name in self.xindexes:
-> 2383                 new_var, new_index = isel_variable_and_index(
   2384                     name, var, self.xindexes[name], var_indexers
   2385                 )

/usr/local/pyenv/versions/3.8.12/lib/python3.8/site-packages/xarray/core/indexes.py in isel_variable_and_index(name, variable, index, indexers)
    505         )
    506 
--> 507     new_variable = variable.isel(indexers)
    508 
    509     if new_variable.dims != (name,):

/usr/local/pyenv/versions/3.8.12/lib/python3.8/site-packages/xarray/core/variable.py in isel(self, indexers, missing_dims, **indexers_kwargs)
   1154 
   1155         key = tuple(indexers.get(dim, slice(None)) for dim in self.dims)
-> 1156         return self[key]
   1157 
   1158     def squeeze(self, dim=None):

/usr/local/pyenv/versions/3.8.12/lib/python3.8/site-packages/xarray/core/variable.py in __getitem__(self, key)
    774         array `x.values` directly.
    775         """
--> 776         dims, indexer, new_order = self._broadcast_indexes(key)
    777         data = as_indexable(self._data)[indexer]
    778         if new_order:

/usr/local/pyenv/versions/3.8.12/lib/python3.8/site-packages/xarray/core/variable.py in _broadcast_indexes(self, key)
    635                 dims.append(d)
    636         if len(set(dims)) == len(dims):
--> 637             return self._broadcast_indexes_outer(key)
    638 
    639         return self._broadcast_indexes_vectorized(key)

/usr/local/pyenv/versions/3.8.12/lib/python3.8/site-packages/xarray/core/variable.py in _broadcast_indexes_outer(self, key)
    696             new_key.append(k)
    697 
--> 698         return dims, OuterIndexer(tuple(new_key)), None
    699 
    700     def _nonzero(self):

/usr/local/pyenv/versions/3.8.12/lib/python3.8/site-packages/xarray/core/indexing.py in __init__(self, key)
    266             elif isinstance(k, np.ndarray):
    267                 if not np.issubdtype(k.dtype, np.integer):
--> 268                     raise TypeError(
    269                         f"invalid indexer array, does not have integer dtype: {k!r}"
    270                     )

TypeError: invalid indexer array, does not have integer dtype: array([24., 25., 25., ..., 77., 77., 77.])

We now have 9 times less points, but the main features of the data remain unchanged:

f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 5))
df_agg.plot.scatter(x='dis_from_border', y='thick', ax=ax1);
df_agg.plot.scatter(x='slope', y='thick', ax=ax2);
df_agg.plot.scatter(x='oggm_mb_above_z', y='thick', ax=ax3);

Build a statistical model

Let’s use scikit-learn to build a very simple linear model of ice-thickness. First, we have to acknowledge that there is a correlation between many of the explanatory variables we will use:

import seaborn as sns
plt.figure(figsize=(10, 8));
sns.heatmap(df.corr(), cmap='RdBu');

This is a problem for linear regression models, which cannot deal well with correlated explanatory variables. We have to do a so-called “feature selection”, i.e. keep only the variables which are independently important to explain the outcome.

For the sake of simplicity, let’s use the Lasso method to help us out:

import warnings
warnings.filterwarnings("ignore")  # sklearn sends a lot of warnings
# Scikit learn (can be installed e.g. via conda install -c anaconda scikit-learn)
from sklearn.linear_model import LassoCV
from sklearn.model_selection import KFold
# Prepare our data
df = df.dropna()
# Variable to model
target = df['thick']
# Predictors - remove x and y (redundant with lon lat)
# Normalize it in order to be able to compare the factors
data = df.drop(['thick', 'x', 'y'], axis=1).copy()
data_mean = data.mean()
data_std = data.std()
data = (data - data_mean) / data_std
# Use cross-validation to select the regularization parameter
lasso_cv = LassoCV(cv=5, random_state=0)
lasso_cv.fit(data.values, target.values)

We now have a statistical model trained on the full dataset. Let’s see how it compares to the calibration data:

odf = df.copy()
odf['thick_predicted'] = lasso_cv.predict(data.values)
f, ax = plt.subplots(figsize=(6, 6));
odf.plot.scatter(x='thick', y='thick_predicted', ax=ax);
plt.xlim([-25, 220]);
plt.ylim([-25, 220]);

Not so well. It is doing OK for low thicknesses, but high thickness are strongly underestimated. Which explanatory variables did the model choose as being the most important?

predictors = pd.Series(lasso_cv.coef_, index=data.columns)
predictors

This is interesting. Lons and lats have a predictive power over thickness? Unfortunately, this is more a coincidence than a reality. If we look at the correlation of the error with the variables of importance, one sees that there is a large correlation between the error and the spatial variables:

odf['error'] = np.abs(odf['thick_predicted'] - odf['thick'])
odf.corr()['error'].loc[predictors.loc[predictors != 0].index]

Furthermore, the model is not very robust. Let’s use cross-validation to check our model parameters:

k_fold = KFold(4, random_state=0, shuffle=True)
for k, (train, test) in enumerate(k_fold.split(data.values, target.values)):
    lasso_cv.fit(data.values[train], target.values[train])
    print("[fold {0}] alpha: {1:.5f}, score (r^2): {2:.5f}".
          format(k, lasso_cv.alpha_, lasso_cv.score(data.values[test], target.values[test])))

The fact that the hyper-parameter alpha and the score change that much between iterations is a sign that the model isn’t very robust.

Our model is bad, but… let’s apply it

In order to apply the model to our entre glacier, we have to get the explanatory variables from the gridded dataset again:

# Add variables we are missing
lon, lat = gdir.grid.ll_coordinates
ds['lon'] = (('y', 'x'), lon)
ds['lat'] = (('y', 'x'), lat)

# Generate our dataset
pred_data = pd.DataFrame()
for vn in data.columns:
    pred_data[vn] = ds[vn].data[ds.glacier_mask == 1]

# Normalize using the same normalization constants
pred_data = (pred_data - data_mean) / data_std

# Apply the model
pred_data['thick'] = lasso_cv.predict(pred_data.values)

# For the sake of physics, clip negative thickness values...
pred_data['thick'] = np.clip(pred_data['thick'], 0, None)
# Back to 2d and in xarray
var = ds[vn].data * np.NaN
var[ds.glacier_mask == 1] = pred_data['thick']
ds['linear_model_thick'] = (('y', 'x'), var)
ds['linear_model_thick'].attrs['description'] = 'Predicted thickness'
ds['linear_model_thick'].attrs['units'] = 'm'
ds['linear_model_thick'].plot();

Take home points

  • we have shown how to compute gridded attributes from OGGM glaciers such as slope, aspect, or catchments

  • we used two methods to extract these data at point locations: with interpolation or with aggregated averages on each grid point

  • as an application example, we trained a linear regression model to predict the ice thickness of this glacier at unseen locations

The model we developed was quite bad and we used quite lousy statistics. If I had more time to make it better, I would:

  • make a pre-selection of meaningful predictors to avoid discontinuities

  • use a non-linear model

  • use cross-validation to better asses the true skill of the model

What’s next?

Bonus: how does the statistical model compare to OGGM “out-of the box”?

# Close the dataset cause we are going to write in it
ds = ds.load()
ds.close()
ds.to_netcdf(gdir.get_filepath('gridded_data'))
# Distribute thickness using default values only
workflow.execute_entity_task(tasks.distribute_thickness_per_altitude, gdirs);
# Add the linear model data for comparison
ds = xr.open_dataset(gdir.get_filepath('gridded_data'))
df_agg['oggm_thick'] = ds.distributed_thickness.isel(x=('z', df_agg['i']), y=('z', df_agg['j']))
df_agg['linear_model_thick'] = ds.linear_model_thick.isel(x=('z', df_agg['i']), y=('z', df_agg['j']))
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5));
ds['linear_model_thick'].plot(ax=ax1);
ds['distributed_thickness'].plot(ax=ax2);
plt.tight_layout();
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6));
df_agg.plot.scatter(x='thick', y='linear_model_thick', ax=ax1);
ax1.set_xlim([-25, 220]);
ax1.set_ylim([-25, 220]);
df_agg.plot.scatter(x='thick', y='oggm_thick', ax=ax2);
ax2.set_xlim([-25, 220]);
ax2.set_ylim([-25, 220]);