Preparing for ML Workflows ¶
-
Given and
xarray.Dataset
of rasters or N-D arrays, reshape them to a feature matrix for ML, retaining coordinate system metadata, and call a sequence of transformations on them. -
MLDataset
from xarray_filters is a subclass ofxarray.Dataset
with methods for reshaping theDataset
'sDataArray
s from time series, rasters, or N-D arrays into a single 2-DDataArray
for input to statistical models. -
New methods:
-
MLDataset.to_features
-
MLDataset.from_features
-
MLDataset.chain
-
import os
import numpy as np
import xarray as xr
from xarray_filters import *
The following cell imports a function to create example
xarray_filters.MLDataset
objects.
from xarray_filters.tests.test_data import new_test_dataset
Example collection of 4-D weather arrays ¶
(x, y, z, t)
for several state variables
X = new_test_dataset(('pressure', 'temperature', 'wind_x', 'wind_y'))
X
Methods of
MLDataset
that are not methods of
xarray.Dataset
:
set(dir(MLDataset)) - set(dir(xr.Dataset))
Aggregating first ¶
One option is to aggregate along 1 or more dims before converting to a single feature matrix
X_means_raster = X.mean(dim=('z', 't'))
X_means_raster
xarray_filters.MLDataset.to_features
¶
to_features()
-
Flatten each 4-D array of
X
to a column -
Concatenates columns to a
DataArray
f = X.to_features()
f
The coordinates of the 4-D arrays are now in a
pandas.MultiIndex
.
f.space
The columns of the
features
DataArray
are named by the
layer
that was flattened from 4-D to a 1-D column. Usage of
OrderedDict
throughout
MLDataset
internals ensures that the
layers
(
DataArray
s) always iterate into the same column order.
f.layer
Showing the first few
(x, y, z, t)
coordinates of the
pandas.MultiIndex
space
:
f.space.indexes['space'].tolist()[:4]
f.space.indexes['space'].names
f.features.values
It is also possible to transpose the
layers
before calling
.ravel()
on each one (the usage of the
trans_dims
keyword to
to_features()
):
example2 = X.mean(dim='x').to_features(trans_dims=('t', 'z', 'y'))
example2
data_vars_func
decorator
¶
The
data_vars_func
decorator allows writing a function that takes named
layers
as keywords or positional arguments. In the example below, it is assumed that the decorated
magnitude
function will be passed to
X.chain
in situations where
X
has
layers
named
wind_x
,
wind_y
. All other
data_vars
keys/values are passed as
other_data_vars
keyword arguments.
@data_vars_func
def magnitude(wind_x, wind_y, **other_data_vars):
a2 = wind_x ** 2
b2 = wind_y ** 2
mag = (a2 + b2) ** 0.5
return dict(magnitude=mag)
X.chain(magnitude, layers=['wind_x', 'wind_y']).to_features(features_layer='magnitude')
for_each_array
decorator
¶
for_each_array
allows automates calling a function that takes a
DataArray
argument and returns a
DataArray
for each
DataArray
(
layer
) in a
MLDataset
:
@for_each_array
def plus_one(arr, **kw):
return arr + 1
@for_each_array
def minus_one(arr, **kw):
return arr - 1
plus = X.chain(plus_one)
minus = X.chain(minus_one)
assert np.all(plus.wind_x - minus.wind_x == 2.)
assert np.all(plus.temperature - minus.temperature == 2.)
@for_each_array
def transform_example(arr, **kw):
up = arr.quantile(0.75, dim='z')
low = arr.quantile(0.25, dim='z')
median = arr.quantile(0.5, dim='z')
return (arr - median) / (up - low)
X.chain(transform_example)
@for_each_array
def agg_example(arr, **kw):
return arr.mean(dim='t').quantile(0.25, dim='z')
aggregated = X.chain((transform_example, agg_example))
aggregated
With
data_vars_func
decorated functions, anything
dict
-like, an
MLDataset
or
xarray.Dataset
may be returned and it will be converted to
MLDataset
:
from collections import OrderedDict
@data_vars_func
def f(wind_x, wind_y, temperature, pressure):
mag = (wind_x ** 2 + wind_y ** 2) ** 0.5
return OrderedDict([('mag', mag), ('temperature', temperature), ('pressure', pressure)])
f(X)
feat = f(X).to_features()
feat
feat.features
feat.features.values
xarray_filters.MLDataset.chain
¶
.chain
can be called on an
MLDataset
to run callables in sequence, passing an
MLDataset
between steps.
@for_each_array
def agg_x(arr, **kw):
return arr.mean(dim='x')
@for_each_array
def agg_y(arr, **kw):
return arr.mean(dim='y')
@for_each_array
def agg_z(arr, **kw):
return arr.mean(dim='z')
time_series = X.chain((agg_x, agg_y, agg_z))
time_series
time_series.to_features().features
Creating some synthetic rasters in
MLDataset
that are similar to LANDSAT imagery with 8 spectral layers:
layers = ['layer_{}'.format(idx) for idx in range(1, 9)]
shape = (200, 200)
rand_np_arr = lambda: np.random.normal(0, 1, shape)
coords = [('x', np.arange(shape[0])), ('y', np.arange(shape[1]))]
rand_data_arr = lambda: xr.DataArray(rand_np_arr(), coords=coords, dims=('x', 'y'))
data_vars = OrderedDict([(layer, rand_data_arr()) for layer in layers])
dset = MLDataset(data_vars)
dset
Examples of chaining callables that use
for_each_array
and
data_vars_func
as decorators, where the example functions also show the variety of return data types allowed in functions decorated by
data_vars_func
.
Note the
keep_arrays=True
keyword argument in the function prototypes - this means that the original
layers
passed into the decorated functions will be part of the
MLDataset
outputs, even if the decorated functions do not return them.
from functools import partial
@for_each_array
def standardize(arr, dim=None, **kw):
mean = arr.mean(dim=dim)
std = arr.std(dim=dim)
return (arr - mean) / std
@data_vars_func
def ndvi(layer_5, layer_4, keep_arrays=True):
return OrderedDict([('ndvi', (layer_5 - layer_4) / (layer_5 + layer_4))])
@data_vars_func
def ndwi(layer_3, layer_5, keep_arrays=True, **kw):
return {'ndwi': (layer_3 - layer_5) / (layer_3 + layer_5)}
@data_vars_func
def mndwi_36(layer_3, layer_6, keep_arrays=True):
return xr.Dataset({'mndwi_36': (layer_3 - layer_6) / (layer_3 + layer_6)})
@data_vars_func
def mndwi_37(layer_3, layer_7, keep_arrays=True):
return MLDataset(OrderedDict([('mndwi_37', (layer_3 - layer_7) / (layer_3 + layer_7))]))
normed_diffs = dset.chain((ndvi, ndwi, mndwi_36, mndwi_37))
standardized = dset.chain(partial(standardize, dim='x'))
normed_diffs
standardized
Merging two
MLDataset
s and converting the merged output to a features 2-D
DataArray
:
catted = normed_diffs.merge(standardized, overwrite_vars=standardized.data_vars.keys())
catted = catted.to_features()
catted.features
catted.layer
catted.from_features()
The following synthetic data example shows the logic above in this notebook can work for any number of dimensions, e.g. the 6-D
DataArray
s below:
shp = (2, 3, 4, 5, 6, 7)
dims = ('a', 'b', 'c', 'd', 'e', 'f')
coords = OrderedDict([(dim, np.arange(s)) for s, dim in zip(shp, dims)])
dset = MLDataset(OrderedDict([('layer_{}'.format(idx),
xr.DataArray(np.random.normal(0, 10, shp),
coords=coords,
dims=dims))
for idx in range(6)]))
dset
dset.layer_0.shape
With 6-D
DataArray
s, calling
to_features
creates a
pandas.MultiIndex
with 6 components:
dset.to_features()
The following cells demonstrate
MLDataset.chain
is the same as calling
.pipe
several times in sequence.
@for_each_array
def example_agg(arr, dim=None):
return arr.std(dim=dim)
@data_vars_func
def layers_example_with_kw(**kw):
new = OrderedDict([('new_layer_100', kw['layer_3'] + kw['layer_4'])])
new.update(kw)
return MLDataset(new)
@data_vars_func
def layers_example_named_args(layer_1, layer_2, new_layer_100):
return MLDataset(OrderedDict([('final', new_layer_100 / (layer_1 + layer_2))]))
dset.pipe(example_agg, dim='a'
).pipe(example_agg, dim='b'
).pipe(layers_example_with_kw
).pipe(layers_example_named_args).to_features()
dset.chain([(example_agg, dict(dim='a')),
(example_agg, dict(dim='b')),
layers_example_with_kw,
layers_example_named_args,
]).to_features()
flattened = dset.chain([(example_agg, dict(dim='a')),
(example_agg, dict(dim='b')),
layers_example_with_kw,
layers_example_named_args,
]).to_features()
flattened.features.values[0:5, 0] = np.NaN
flattened.layer