Preparing for ML Workflows

  • Given and xarray.Dataset of rasters or N-D arrays, reshape them to a feature matrix for ML, retaining coordinate system metadata, and call a sequence of transformations on them.
  • MLDataset from xarray_filters is a subclass of xarray.Dataset with methods for reshaping the Dataset 's DataArray s from time series, rasters, or N-D arrays into a single 2-D DataArray for input to statistical models.
  • New methods:
    • MLDataset.to_features
    • MLDataset.from_features
    • MLDataset.chain
In [1]:
import os

import numpy as np
import xarray as xr
from xarray_filters import *

The following cell imports a function to create example xarray_filters.MLDataset objects.

In [2]:
from xarray_filters.tests.test_data import new_test_dataset

Example collection of 4-D weather arrays

(x, y, z, t) for several state variables

In [3]:
X = new_test_dataset(('pressure', 'temperature', 'wind_x', 'wind_y'))
In [4]:
X
Out[4]:
<xarray.MLDataset>
Dimensions:      (t: 48, x: 20, y: 15, z: 8)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  * z            (z) int64 0 1 2 3 4 5 6 7
  * t            (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    pressure     (x, y, z, t) float64 0.943 0.8551 0.8603 0.1391 0.1404 ...
    temperature  (x, y, z, t) float64 0.5724 0.5373 0.3906 0.4199 0.4054 ...
    wind_x       (x, y, z, t) float64 0.8993 0.1015 0.3021 0.2888 0.9948 ...
    wind_y       (x, y, z, t) float64 0.421 0.548 0.129 0.8564 0.3161 0.6974 ...

Methods of MLDataset that are not methods of xarray.Dataset :

In [5]:
set(dir(MLDataset)) - set(dir(xr.Dataset))
Out[5]:
{'_guess_yname',
 'chain',
 'concat_ml_features',
 'from_features',
 'has_features',
 'to_dataset',
 'to_features',
 'to_mldataset',
 'to_xy_arrays'}

Aggregating first

One option is to aggregate along 1 or more dims before converting to a single feature matrix

In [6]:
X_means_raster = X.mean(dim=('z', 't'))
X_means_raster
Out[6]:
<xarray.MLDataset>
Dimensions:      (x: 20, y: 15)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Data variables:
    pressure     (x, y) float64 0.508 0.5191 0.5156 0.5092 0.5189 0.5338 ...
    temperature  (x, y) float64 0.5052 0.5113 0.4797 0.4966 0.4957 0.4917 ...
    wind_x       (x, y) float64 0.4842 0.4966 0.5015 0.4939 0.5007 0.4915 ...
    wind_y       (x, y) float64 0.5257 0.4968 0.4746 0.509 0.459 0.5048 ...

xarray_filters.MLDataset.to_features

to_features()

  • Flatten each 4-D array of X to a column
  • Concatenates columns to a DataArray
In [7]:
f = X.to_features()
f
Out[7]:
<xarray.MLDataset>
Dimensions:   (layer: 4, space: 115200)
Coordinates:
  * space     (space) MultiIndex
  - x         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - t         (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  * layer     (layer) object 'pressure' 'temperature' 'wind_x' 'wind_y'
Data variables:
    features  (space, layer) float64 0.943 0.5724 0.8993 0.421 0.8551 0.5373 ...

The coordinates of the 4-D arrays are now in a pandas.MultiIndex .

In [8]:
f.space
Out[8]:
<xarray.DataArray 'space' (space: 115200)>
array([(0, 0, 0, 0), (0, 0, 0, 1), (0, 0, 0, 2), ..., (19, 14, 7, 45),
       (19, 14, 7, 46), (19, 14, 7, 47)], dtype=object)
Coordinates:
  * space    (space) MultiIndex
  - x        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - t        (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...

The columns of the features DataArray are named by the layer that was flattened from 4-D to a 1-D column. Usage of OrderedDict throughout MLDataset internals ensures that the layers ( DataArray s) always iterate into the same column order.

In [9]:
f.layer
Out[9]:
<xarray.DataArray 'layer' (layer: 4)>
array(['pressure', 'temperature', 'wind_x', 'wind_y'], dtype=object)
Coordinates:
  * layer    (layer) object 'pressure' 'temperature' 'wind_x' 'wind_y'

Showing the first few (x, y, z, t) coordinates of the pandas.MultiIndex space :

In [10]:
f.space.indexes['space'].tolist()[:4]
Out[10]:
[(0, 0, 0, 0), (0, 0, 0, 1), (0, 0, 0, 2), (0, 0, 0, 3)]
In [11]:
f.space.indexes['space'].names
Out[11]:
FrozenList(['x', 'y', 'z', 't'])
In [12]:
f.features.values
Out[12]:
array([[ 0.94304574,  0.57237128,  0.89926858,  0.42103063],
       [ 0.8550843 ,  0.53726692,  0.10153233,  0.54802044],
       [ 0.86031506,  0.3906344 ,  0.30205406,  0.12903882],
       ..., 
       [ 0.41045238,  0.74009848,  0.78442477,  0.30888828],
       [ 0.94600064,  0.00691722,  0.10819362,  0.57415637],
       [ 0.76668616,  0.65057938,  0.69437022,  0.97104589]])

It is also possible to transpose the layers before calling .ravel() on each one (the usage of the trans_dims keyword to to_features() ):

In [13]:
example2 = X.mean(dim='x').to_features(trans_dims=('t', 'z', 'y'))
example2
Out[13]:
<xarray.MLDataset>
Dimensions:   (layer: 4, space: 5760)
Coordinates:
  * space     (space) MultiIndex
  - t         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 ...
  - y         (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 ...
  * layer     (layer) object 'pressure' 'temperature' 'wind_x' 'wind_y'
Data variables:
    features  (space, layer) float64 0.5578 0.5186 0.6317 0.4923 0.5377 ...

data_vars_func decorator

The data_vars_func decorator allows writing a function that takes named layers as keywords or positional arguments. In the example below, it is assumed that the decorated magnitude function will be passed to X.chain in situations where X has layers named wind_x , wind_y . All other data_vars keys/values are passed as other_data_vars keyword arguments.

In [14]:
@data_vars_func
def magnitude(wind_x, wind_y, **other_data_vars):
    a2 = wind_x ** 2
    b2 = wind_y ** 2
    mag = (a2 + b2) ** 0.5
    return dict(magnitude=mag)
X.chain(magnitude, layers=['wind_x', 'wind_y']).to_features(features_layer='magnitude')
/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/dataset.py:378: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  both_data_and_coords = [k for k in data_vars if k in coords]
Out[14]:
<xarray.MLDataset>
Dimensions:    (t: 48, x: 20, y: 15, z: 8)
Coordinates:
  * x          (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y          (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  * z          (z) int64 0 1 2 3 4 5 6 7
  * t          (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    magnitude  (x, y, z, t) float64 0.993 0.5573 0.3285 0.9038 1.044 0.8807 ...

for_each_array decorator

for_each_array allows automates calling a function that takes a DataArray argument and returns a DataArray for each DataArray ( layer ) in a MLDataset :

In [15]:
@for_each_array
def plus_one(arr, **kw):
    return arr + 1

@for_each_array
def minus_one(arr, **kw):
    return arr - 1


plus = X.chain(plus_one)
minus = X.chain(minus_one)

assert np.all(plus.wind_x - minus.wind_x == 2.)
assert np.all(plus.temperature - minus.temperature == 2.)
In [16]:
@for_each_array
def transform_example(arr, **kw):
    up = arr.quantile(0.75, dim='z')
    low = arr.quantile(0.25, dim='z')
    median = arr.quantile(0.5, dim='z')
    return (arr - median) / (up - low)

X.chain(transform_example)
Out[16]:
<xarray.MLDataset>
Dimensions:      (t: 48, x: 20, y: 15, z: 8)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  * z            (z) int64 0 1 2 3 4 5 6 7
  * t            (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
    quantile     float64 0.5
Data variables:
    pressure     (x, y, z, t) float64 0.5874 0.4168 1.752 -0.5453 -0.6057 ...
    temperature  (x, y, z, t) float64 0.0413 -0.1025 -0.4854 -0.1276 0.07339 ...
    wind_x       (x, y, z, t) float64 0.7737 -1.761 -0.8992 -0.4616 1.172 ...
    wind_y       (x, y, z, t) float64 0.3701 0.129 -1.655 1.048 -1.106 ...
In [17]:
@for_each_array
def agg_example(arr, **kw):
    return arr.mean(dim='t').quantile(0.25, dim='z')

aggregated = X.chain((transform_example, agg_example))
In [18]:
aggregated
Out[18]:
<xarray.MLDataset>
Dimensions:      (x: 20, y: 15)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
    quantile     float64 0.25
Data variables:
    pressure     (x, y) float64 -0.01103 -0.04669 -0.05397 -0.04531 -0.1158 ...
    temperature  (x, y) float64 0.02512 -0.06629 -0.04142 -0.01822 -0.04365 ...
    wind_x       (x, y) float64 -0.104 -0.03628 -0.09501 -0.04018 -0.04926 ...
    wind_y       (x, y) float64 -0.09337 -0.06528 0.04342 -0.08975 ...

With data_vars_func decorated functions, anything dict -like, an MLDataset or xarray.Dataset may be returned and it will be converted to MLDataset :

In [19]:
from collections import OrderedDict
@data_vars_func
def f(wind_x, wind_y, temperature, pressure):
    mag = (wind_x ** 2 + wind_y ** 2) ** 0.5
    return OrderedDict([('mag', mag), ('temperature', temperature), ('pressure', pressure)])

f(X)
/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/dataset.py:378: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  both_data_and_coords = [k for k in data_vars if k in coords]
Out[19]:
<xarray.MLDataset>
Dimensions:      (t: 48, x: 20, y: 15, z: 8)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  * z            (z) int64 0 1 2 3 4 5 6 7
  * t            (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    mag          (x, y, z, t) float64 0.993 0.5573 0.3285 0.9038 1.044 ...
    temperature  (x, y, z, t) float64 0.5724 0.5373 0.3906 0.4199 0.4054 ...
    pressure     (x, y, z, t) float64 0.943 0.8551 0.8603 0.1391 0.1404 ...
In [20]:
feat = f(X).to_features()
feat
/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/dataset.py:378: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  both_data_and_coords = [k for k in data_vars if k in coords]
Out[20]:
<xarray.MLDataset>
Dimensions:   (layer: 3, space: 115200)
Coordinates:
  * space     (space) MultiIndex
  - x         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - t         (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  * layer     (layer) object 'mag' 'temperature' 'pressure'
Data variables:
    features  (space, layer) float64 0.993 0.5724 0.943 0.5573 0.5373 0.8551 ...
In [21]:
feat.features
Out[21]:
<xarray.DataArray 'features' (space: 115200, layer: 3)>
array([[ 0.992951,  0.572371,  0.943046],
       [ 0.557347,  0.537267,  0.855084],
       [ 0.328463,  0.390634,  0.860315],
       ..., 
       [ 0.843051,  0.740098,  0.410452],
       [ 0.584261,  0.006917,  0.946001],
       [ 1.193767,  0.650579,  0.766686]])
Coordinates:
  * space    (space) MultiIndex
  - x        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - t        (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
  * layer    (layer) object 'mag' 'temperature' 'pressure'
In [22]:
feat.features.values
Out[22]:
array([[ 0.99295053,  0.57237128,  0.94304574],
       [ 0.55734658,  0.53726692,  0.8550843 ],
       [ 0.32846259,  0.3906344 ,  0.86031506],
       ..., 
       [ 0.84305053,  0.74009848,  0.41045238],
       [ 0.58426141,  0.00691722,  0.94600064],
       [ 1.19376719,  0.65057938,  0.76668616]])

xarray_filters.MLDataset.chain

.chain can be called on an MLDataset to run callables in sequence, passing an MLDataset between steps.

In [23]:
@for_each_array
def agg_x(arr, **kw):
    return arr.mean(dim='x')

@for_each_array
def agg_y(arr, **kw):
    return arr.mean(dim='y')

@for_each_array
def agg_z(arr, **kw):
    return arr.mean(dim='z')


time_series = X.chain((agg_x, agg_y, agg_z))
time_series
Out[23]:
<xarray.MLDataset>
Dimensions:      (t: 48)
Coordinates:
  * t            (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    pressure     (t) float64 0.508 0.4951 0.4975 0.5013 0.5014 0.5066 0.4998 ...
    temperature  (t) float64 0.5044 0.5052 0.5007 0.5057 0.4996 0.5017 ...
    wind_x       (t) float64 0.4913 0.5015 0.5028 0.4994 0.5049 0.4974 ...
    wind_y       (t) float64 0.5018 0.4952 0.5008 0.5008 0.5029 0.5041 ...
In [24]:
time_series.to_features().features
Out[24]:
<xarray.DataArray 'features' (t: 48, layer: 4)>
array([[ 0.508011,  0.50441 ,  0.491325,  0.501791],
       [ 0.495105,  0.505195,  0.501508,  0.495244],
       [ 0.497456,  0.500709,  0.50275 ,  0.500827],
       [ 0.501316,  0.505695,  0.499432,  0.500814],
       [ 0.501423,  0.499579,  0.504869,  0.502901],
       [ 0.506578,  0.501706,  0.497429,  0.504051],
       [ 0.499826,  0.5057  ,  0.496623,  0.504154],
       [ 0.497774,  0.494554,  0.506317,  0.49565 ],
       [ 0.49754 ,  0.502164,  0.503356,  0.489029],
       [ 0.500576,  0.501813,  0.501686,  0.503828],
       [ 0.506306,  0.501103,  0.492398,  0.50422 ],
       [ 0.494418,  0.49479 ,  0.500274,  0.496992],
       [ 0.509054,  0.498496,  0.507002,  0.502409],
       [ 0.498281,  0.483517,  0.491937,  0.496341],
       [ 0.493782,  0.498443,  0.496371,  0.497739],
       [ 0.506201,  0.493848,  0.503532,  0.501417],
       [ 0.490193,  0.492893,  0.49947 ,  0.507501],
       [ 0.493411,  0.506393,  0.499553,  0.489611],
       [ 0.491931,  0.503453,  0.509046,  0.497308],
       [ 0.500486,  0.50004 ,  0.504978,  0.500159],
       [ 0.502545,  0.506109,  0.501411,  0.497691],
       [ 0.505572,  0.508179,  0.502869,  0.497524],
       [ 0.493383,  0.499389,  0.504163,  0.50949 ],
       [ 0.509835,  0.505621,  0.488983,  0.494115],
       [ 0.51022 ,  0.500319,  0.496296,  0.504541],
       [ 0.49236 ,  0.485227,  0.501256,  0.502297],
       [ 0.496386,  0.501631,  0.493982,  0.489832],
       [ 0.486673,  0.496127,  0.496505,  0.50021 ],
       [ 0.514912,  0.493692,  0.499438,  0.50293 ],
       [ 0.505792,  0.496942,  0.502634,  0.500891],
       [ 0.498338,  0.516481,  0.504869,  0.495852],
       [ 0.493812,  0.496064,  0.504753,  0.494797],
       [ 0.500156,  0.492373,  0.493381,  0.514306],
       [ 0.491772,  0.504517,  0.485997,  0.501903],
       [ 0.489161,  0.498377,  0.493417,  0.499951],
       [ 0.501926,  0.508538,  0.506648,  0.502152],
       [ 0.512457,  0.494537,  0.490528,  0.498282],
       [ 0.49954 ,  0.500534,  0.507356,  0.499091],
       [ 0.498592,  0.497729,  0.497631,  0.496413],
       [ 0.507836,  0.491338,  0.496549,  0.493771],
       [ 0.500518,  0.500438,  0.50249 ,  0.500901],
       [ 0.49996 ,  0.503797,  0.496763,  0.504346],
       [ 0.504079,  0.506028,  0.498611,  0.495113],
       [ 0.497707,  0.508452,  0.504747,  0.504746],
       [ 0.510344,  0.499445,  0.498672,  0.498012],
       [ 0.501891,  0.492353,  0.496031,  0.489664],
       [ 0.505161,  0.503386,  0.503797,  0.50416 ],
       [ 0.501192,  0.503771,  0.498621,  0.513291]])
Coordinates:
  * t        (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * layer    (layer) object 'pressure' 'temperature' 'wind_x' 'wind_y'

Creating some synthetic rasters in MLDataset that are similar to LANDSAT imagery with 8 spectral layers:

In [25]:
layers = ['layer_{}'.format(idx) for idx in range(1, 9)]
shape = (200, 200)
rand_np_arr = lambda: np.random.normal(0, 1, shape)
coords = [('x', np.arange(shape[0])), ('y', np.arange(shape[1]))]
rand_data_arr = lambda: xr.DataArray(rand_np_arr(), coords=coords, dims=('x', 'y'))
data_vars = OrderedDict([(layer, rand_data_arr()) for layer in layers])
dset = MLDataset(data_vars)
dset
Out[25]:
<xarray.MLDataset>
Dimensions:  (x: 200, y: 200)
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
    layer_1  (x, y) float64 0.2057 -1.15 1.308 0.6295 0.9121 -1.244 -0.7768 ...
    layer_2  (x, y) float64 -0.3386 1.763 1.673 1.176 -1.307 -1.33 -1.343 ...
    layer_3  (x, y) float64 -0.9408 -1.036 1.102 0.4921 -0.516 -0.5561 ...
    layer_4  (x, y) float64 -1.742 0.6987 0.9873 -0.02936 -1.073 -0.1992 ...
    layer_5  (x, y) float64 -1.049 0.6421 -1.749 1.623 0.5875 -1.782 -1.646 ...
    layer_6  (x, y) float64 -1.397 -0.3429 -0.101 0.0783 0.07475 -0.1138 ...
    layer_7  (x, y) float64 -2.528 -1.201 -0.875 0.8027 -0.5513 -0.1087 ...
    layer_8  (x, y) float64 -0.6376 -0.05374 -0.2068 1.514 0.354 -0.01112 ...

Examples of chaining callables that use for_each_array and data_vars_func as decorators, where the example functions also show the variety of return data types allowed in functions decorated by data_vars_func .

Note the keep_arrays=True keyword argument in the function prototypes - this means that the original layers passed into the decorated functions will be part of the MLDataset outputs, even if the decorated functions do not return them.

In [26]:
from functools import partial
@for_each_array
def standardize(arr, dim=None, **kw):
    mean = arr.mean(dim=dim)
    std = arr.std(dim=dim)
    return (arr - mean) / std

@data_vars_func
def ndvi(layer_5, layer_4, keep_arrays=True):
    return OrderedDict([('ndvi', (layer_5 - layer_4) / (layer_5 + layer_4))])


@data_vars_func
def ndwi(layer_3, layer_5, keep_arrays=True, **kw):
    return {'ndwi': (layer_3 - layer_5) / (layer_3 + layer_5)}


@data_vars_func
def mndwi_36(layer_3, layer_6, keep_arrays=True):
    return xr.Dataset({'mndwi_36': (layer_3 - layer_6) / (layer_3 + layer_6)})


@data_vars_func
def mndwi_37(layer_3, layer_7, keep_arrays=True):
    return MLDataset(OrderedDict([('mndwi_37', (layer_3 - layer_7) / (layer_3 + layer_7))]))

normed_diffs = dset.chain((ndvi, ndwi, mndwi_36, mndwi_37))
standardized = dset.chain(partial(standardize, dim='x'))
/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/dataset.py:378: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  both_data_and_coords = [k for k in data_vars if k in coords]
In [27]:
normed_diffs
Out[27]:
<xarray.MLDataset>
Dimensions:   (x: 200, y: 200)
Coordinates:
  * x         (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * y         (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
    mndwi_37  (x, y) float64 -0.4575 -0.07409 8.71 -0.2399 -0.03307 0.673 ...
    mndwi_36  (x, y) float64 -0.1951 0.5026 1.202 0.7255 1.339 0.6604 ...
    ndwi      (x, y) float64 -0.05435 4.263 -4.403 -0.5347 -15.43 -0.5244 ...
    ndvi      (x, y) float64 -0.2483 -0.04214 3.591 1.037 -3.421 0.799 ...
    layer_1   (x, y) float64 0.2057 -1.15 1.308 0.6295 0.9121 -1.244 -0.7768 ...
    layer_2   (x, y) float64 -0.3386 1.763 1.673 1.176 -1.307 -1.33 -1.343 ...
    layer_3   (x, y) float64 -0.9408 -1.036 1.102 0.4921 -0.516 -0.5561 ...
    layer_4   (x, y) float64 -1.742 0.6987 0.9873 -0.02936 -1.073 -0.1992 ...
    layer_5   (x, y) float64 -1.049 0.6421 -1.749 1.623 0.5875 -1.782 -1.646 ...
    layer_6   (x, y) float64 -1.397 -0.3429 -0.101 0.0783 0.07475 -0.1138 ...
    layer_7   (x, y) float64 -2.528 -1.201 -0.875 0.8027 -0.5513 -0.1087 ...
    layer_8   (x, y) float64 -0.6376 -0.05374 -0.2068 1.514 0.354 -0.01112 ...
In [28]:
standardized
Out[28]:
<xarray.MLDataset>
Dimensions:  (x: 200, y: 200)
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
    layer_1  (x, y) float64 0.1133 -0.9913 1.208 0.5796 0.9299 -1.281 ...
    layer_2  (x, y) float64 -0.2191 1.937 1.725 1.348 -1.252 -1.393 -1.192 ...
    layer_3  (x, y) float64 -0.9028 -1.068 1.224 0.5139 -0.3531 -0.6708 ...
    layer_4  (x, y) float64 -1.582 0.7389 1.006 0.0176 -1.053 -0.2102 ...
    layer_5  (x, y) float64 -1.131 0.5838 -1.931 1.649 0.6675 -1.886 -1.708 ...
    layer_6  (x, y) float64 -1.215 -0.3089 -0.1669 0.1324 0.1582 -0.1556 ...
    layer_7  (x, y) float64 -2.392 -1.141 -0.745 0.6597 -0.6249 0.006187 ...
    layer_8  (x, y) float64 -0.4762 -0.1219 -0.1356 1.557 0.3148 -0.132 ...

Merging two MLDataset s and converting the merged output to a features 2-D DataArray :

In [29]:
catted = normed_diffs.merge(standardized, overwrite_vars=standardized.data_vars.keys())
catted = catted.to_features()
/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/merge.py:533: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  elif overwrite_vars == set(other):
/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/_collections_abc.py:743: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  for key in self._mapping:
In [30]:
catted.features
Out[30]:
<xarray.DataArray 'features' (space: 40000, layer: 12)>
array([[-0.45752 , -0.195089, -0.054352, ..., -1.214781, -2.391996, -0.476214],
       [-0.074091,  0.502572,  4.263453, ..., -0.30895 , -1.141291, -0.121904],
       [ 8.709869,  1.201793, -4.403228, ..., -0.166914, -0.745047, -0.135598],
       ..., 
       [-0.379463, -0.696751, -0.391354, ..., -0.697039, -0.374042,  0.890838],
       [-0.279059, -8.709046,  1.783529, ..., -1.063337,  1.727148, -0.778712],
       [-0.245155, -5.143229,  0.257451, ..., -1.842043,  2.277689, -0.09115 ]])
Coordinates:
  * space    (space) MultiIndex
  - x        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y        (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
  * layer    (layer) object 'mndwi_37' 'mndwi_36' 'ndwi' 'ndvi' 'layer_1' ...
In [31]:
catted.layer
Out[31]:
<xarray.DataArray 'layer' (layer: 12)>
array(['mndwi_37', 'mndwi_36', 'ndwi', 'ndvi', 'layer_1', 'layer_2', 'layer_3',
       'layer_4', 'layer_5', 'layer_6', 'layer_7', 'layer_8'], dtype=object)
Coordinates:
  * layer    (layer) object 'mndwi_37' 'mndwi_36' 'ndwi' 'ndvi' 'layer_1' ...
In [32]:
catted.from_features()
Out[32]:
<xarray.MLDataset>
Dimensions:   (x: 200, y: 200)
Coordinates:
  * x         (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * y         (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
    mndwi_37  (x, y) float64 -0.4575 -0.07409 8.71 -0.2399 -0.03307 0.673 ...
    mndwi_36  (x, y) float64 -0.1951 0.5026 1.202 0.7255 1.339 0.6604 ...
    ndwi      (x, y) float64 -0.05435 4.263 -4.403 -0.5347 -15.43 -0.5244 ...
    ndvi      (x, y) float64 -0.2483 -0.04214 3.591 1.037 -3.421 0.799 ...
    layer_1   (x, y) float64 0.1133 -0.9913 1.208 0.5796 0.9299 -1.281 ...
    layer_2   (x, y) float64 -0.2191 1.937 1.725 1.348 -1.252 -1.393 -1.192 ...
    layer_3   (x, y) float64 -0.9028 -1.068 1.224 0.5139 -0.3531 -0.6708 ...
    layer_4   (x, y) float64 -1.582 0.7389 1.006 0.0176 -1.053 -0.2102 ...
    layer_5   (x, y) float64 -1.131 0.5838 -1.931 1.649 0.6675 -1.886 -1.708 ...
    layer_6   (x, y) float64 -1.215 -0.3089 -0.1669 0.1324 0.1582 -0.1556 ...
    layer_7   (x, y) float64 -2.392 -1.141 -0.745 0.6597 -0.6249 0.006187 ...
    layer_8   (x, y) float64 -0.4762 -0.1219 -0.1356 1.557 0.3148 -0.132 ...

The following synthetic data example shows the logic above in this notebook can work for any number of dimensions, e.g. the 6-D DataArray s below:

In [33]:
shp = (2, 3, 4, 5, 6, 7)
dims = ('a', 'b', 'c', 'd', 'e', 'f')
coords = OrderedDict([(dim, np.arange(s)) for s, dim in zip(shp, dims)])
dset = MLDataset(OrderedDict([('layer_{}'.format(idx), 
                               xr.DataArray(np.random.normal(0, 10, shp),
                                            coords=coords,
                                            dims=dims)) 
                              for idx in range(6)]))
dset
Out[33]:
<xarray.MLDataset>
Dimensions:  (a: 2, b: 3, c: 4, d: 5, e: 6, f: 7)
Coordinates:
  * a        (a) int64 0 1
  * b        (b) int64 0 1 2
  * c        (c) int64 0 1 2 3
  * d        (d) int64 0 1 2 3 4
  * e        (e) int64 0 1 2 3 4 5
  * f        (f) int64 0 1 2 3 4 5 6
Data variables:
    layer_0  (a, b, c, d, e, f) float64 4.86 8.261 3.274 -10.64 -1.401 ...
    layer_1  (a, b, c, d, e, f) float64 -2.849 11.3 -1.716 10.55 14.69 ...
    layer_2  (a, b, c, d, e, f) float64 -0.9942 -10.1 -20.37 -5.836 -1.135 ...
    layer_3  (a, b, c, d, e, f) float64 4.382 -3.699 4.419 -2.466 -9.8 3.299 ...
    layer_4  (a, b, c, d, e, f) float64 -6.235 -6.037 14.62 0.3124 -7.587 ...
    layer_5  (a, b, c, d, e, f) float64 -22.43 2.969 -3.328 5.508 -6.908 ...
In [34]:
dset.layer_0.shape
Out[34]:
(2, 3, 4, 5, 6, 7)

With 6-D DataArray s, calling to_features creates a pandas.MultiIndex with 6 components:

In [35]:
dset.to_features()
Out[35]:
<xarray.MLDataset>
Dimensions:   (layer: 6, space: 5040)
Coordinates:
  * space     (space) MultiIndex
  - a         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - b         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - c         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - d         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - e         (space) int64 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 ...
  - f         (space) int64 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 ...
  * layer     (layer) object 'layer_0' 'layer_1' 'layer_2' 'layer_3' ...
Data variables:
    features  (space, layer) float64 4.86 -2.849 -0.9942 4.382 -6.235 -22.43 ...

The following cells demonstrate MLDataset.chain is the same as calling .pipe several times in sequence.

In [36]:
@for_each_array
def example_agg(arr, dim=None):
    return arr.std(dim=dim)

@data_vars_func
def layers_example_with_kw(**kw):
    new = OrderedDict([('new_layer_100', kw['layer_3'] + kw['layer_4'])])
    new.update(kw)
    return MLDataset(new)

@data_vars_func
def layers_example_named_args(layer_1, layer_2, new_layer_100):
    return MLDataset(OrderedDict([('final', new_layer_100 / (layer_1 + layer_2))]))
In [37]:
dset.pipe(example_agg, dim='a'
         ).pipe(example_agg, dim='b'
               ).pipe(layers_example_with_kw
                     ).pipe(layers_example_named_args).to_features()
/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/dataset.py:378: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  both_data_and_coords = [k for k in data_vars if k in coords]
Out[37]:
<xarray.MLDataset>
Dimensions:   (layer: 1, space: 840)
Coordinates:
  * space     (space) MultiIndex
  - c         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - d         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - e         (space) int64 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 ...
  - f         (space) int64 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 ...
  * layer     (layer) object 'final'
Data variables:
    features  (space, layer) float64 0.3976 0.6714 0.3157 0.4596 0.9187 ...
In [38]:
dset.chain([(example_agg, dict(dim='a')),
             (example_agg, dict(dim='b')),
             layers_example_with_kw,
             layers_example_named_args,
            ]).to_features()
/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/dataset.py:378: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  both_data_and_coords = [k for k in data_vars if k in coords]
Out[38]:
<xarray.MLDataset>
Dimensions:   (layer: 1, space: 840)
Coordinates:
  * space     (space) MultiIndex
  - c         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - d         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - e         (space) int64 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 ...
  - f         (space) int64 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 ...
  * layer     (layer) object 'final'
Data variables:
    features  (space, layer) float64 0.3976 0.6714 0.3157 0.4596 0.9187 ...
In [39]:
flattened = dset.chain([(example_agg, dict(dim='a')),
                         (example_agg, dict(dim='b')),
                         layers_example_with_kw,
                         layers_example_named_args,
                        ]).to_features()
/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/dataset.py:378: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  both_data_and_coords = [k for k in data_vars if k in coords]
In [40]:
flattened.features.values[0:5, 0] = np.NaN
In [41]:
flattened.layer
Out[41]:
<xarray.DataArray 'layer' (layer: 1)>
array(['final'], dtype=object)
Coordinates:
  * layer    (layer) object 'final'

Right click to download this notebook from GitHub.