Given and xarray.Dataset of rasters or N-D arrays, reshape them to a feature matrix for ML, retaining coordinate system metadata, and call a sequence of transformations on them.
MLDataset from xarray_filters is a subclass of xarray.Dataset with methods for reshaping the Dataset 's DataArray s from time series, rasters, or N-D arrays into a single 2-D DataArray for input to statistical models.
New methods:
- MLDataset.to_features
- MLDataset.from_features
- MLDataset.chain

In [1]:

import os

import numpy as np
import xarray as xr
from xarray_filters import *

The following cell imports a function to create example xarray_filters.MLDataset objects.

In [2]:

from xarray_filters.tests.test_data import new_test_dataset

Example collection of 4-D weather arrays ¶

(x, y, z, t) for several state variables

In [3]:

X = new_test_dataset(('pressure', 'temperature', 'wind_x', 'wind_y'))

In [4]:

Out[4]:

<xarray.MLDataset>
Dimensions:      (t: 48, x: 20, y: 15, z: 8)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  * z            (z) int64 0 1 2 3 4 5 6 7
  * t            (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    pressure     (x, y, z, t) float64 0.943 0.8551 0.8603 0.1391 0.1404 ...
    temperature  (x, y, z, t) float64 0.5724 0.5373 0.3906 0.4199 0.4054 ...
    wind_x       (x, y, z, t) float64 0.8993 0.1015 0.3021 0.2888 0.9948 ...
    wind_y       (x, y, z, t) float64 0.421 0.548 0.129 0.8564 0.3161 0.6974 ...

Methods of MLDataset that are not methods of xarray.Dataset :

In [5]:

set(dir(MLDataset)) - set(dir(xr.Dataset))

Out[5]:

{'_guess_yname',
 'chain',
 'concat_ml_features',
 'from_features',
 'has_features',
 'to_dataset',
 'to_features',
 'to_mldataset',
 'to_xy_arrays'}

Aggregating first ¶

One option is to aggregate along 1 or more dims before converting to a single feature matrix

In [6]:

X_means_raster = X.mean(dim=('z', 't'))
X_means_raster

Out[6]:

<xarray.MLDataset>
Dimensions:      (x: 20, y: 15)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Data variables:
    pressure     (x, y) float64 0.508 0.5191 0.5156 0.5092 0.5189 0.5338 ...
    temperature  (x, y) float64 0.5052 0.5113 0.4797 0.4966 0.4957 0.4917 ...
    wind_x       (x, y) float64 0.4842 0.4966 0.5015 0.4939 0.5007 0.4915 ...
    wind_y       (x, y) float64 0.5257 0.4968 0.4746 0.509 0.459 0.5048 ...

`xarray_filters.MLDataset.to_features` ¶

to_features()

Flatten each 4-D array of X to a column
Concatenates columns to a DataArray

In [7]:

f = X.to_features()
f

Out[7]:

<xarray.MLDataset>
Dimensions:   (layer: 4, space: 115200)
Coordinates:
  * space     (space) MultiIndex
  - x         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - t         (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  * layer     (layer) object 'pressure' 'temperature' 'wind_x' 'wind_y'
Data variables:
    features  (space, layer) float64 0.943 0.5724 0.8993 0.421 0.8551 0.5373 ...

The coordinates of the 4-D arrays are now in a pandas.MultiIndex .

In [8]:

f.space

Out[8]:

<xarray.DataArray 'space' (space: 115200)>
array([(0, 0, 0, 0), (0, 0, 0, 1), (0, 0, 0, 2), ..., (19, 14, 7, 45),
       (19, 14, 7, 46), (19, 14, 7, 47)], dtype=object)
Coordinates:
  * space    (space) MultiIndex
  - x        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - t        (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...

The columns of the features DataArray are named by the layer that was flattened from 4-D to a 1-D column. Usage of OrderedDict throughout MLDataset internals ensures that the layers ( DataArray s) always iterate into the same column order.

In [9]:

f.layer

Out[9]:

<xarray.DataArray 'layer' (layer: 4)>
array(['pressure', 'temperature', 'wind_x', 'wind_y'], dtype=object)
Coordinates:
  * layer    (layer) object 'pressure' 'temperature' 'wind_x' 'wind_y'

Showing the first few (x, y, z, t) coordinates of the pandas.MultiIndex space :

In [10]:

f.space.indexes['space'].tolist()[:4]

Out[10]:

[(0, 0, 0, 0), (0, 0, 0, 1), (0, 0, 0, 2), (0, 0, 0, 3)]

In [11]:

f.space.indexes['space'].names

Out[11]:

FrozenList(['x', 'y', 'z', 't'])

In [12]:

f.features.values

Out[12]:

array([[ 0.94304574,  0.57237128,  0.89926858,  0.42103063],
       [ 0.8550843 ,  0.53726692,  0.10153233,  0.54802044],
       [ 0.86031506,  0.3906344 ,  0.30205406,  0.12903882],
       ..., 
       [ 0.41045238,  0.74009848,  0.78442477,  0.30888828],
       [ 0.94600064,  0.00691722,  0.10819362,  0.57415637],
       [ 0.76668616,  0.65057938,  0.69437022,  0.97104589]])

It is also possible to transpose the layers before calling .ravel() on each one (the usage of the trans_dims keyword to to_features() ):

In [13]:

example2 = X.mean(dim='x').to_features(trans_dims=('t', 'z', 'y'))
example2

Out[13]:

<xarray.MLDataset>
Dimensions:   (layer: 4, space: 5760)
Coordinates:
  * space     (space) MultiIndex
  - t         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 ...
  - y         (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 ...
  * layer     (layer) object 'pressure' 'temperature' 'wind_x' 'wind_y'
Data variables:
    features  (space, layer) float64 0.5578 0.5186 0.6317 0.4923 0.5377 ...

`data_vars_func` decorator ¶

The data_vars_func decorator allows writing a function that takes named layers as keywords or positional arguments. In the example below, it is assumed that the decorated magnitude function will be passed to X.chain in situations where X has layers named wind_x , wind_y . All other data_vars keys/values are passed as other_data_vars keyword arguments.

In [14]:

@data_vars_func
def magnitude(wind_x, wind_y, **other_data_vars):
    a2 = wind_x ** 2
    b2 = wind_y ** 2
    mag = (a2 + b2) ** 0.5
    return dict(magnitude=mag)
X.chain(magnitude, layers=['wind_x', 'wind_y']).to_features(features_layer='magnitude')

/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/dataset.py:378: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  both_data_and_coords = [k for k in data_vars if k in coords]

Out[14]:

<xarray.MLDataset>
Dimensions:    (t: 48, x: 20, y: 15, z: 8)
Coordinates:
  * x          (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y          (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  * z          (z) int64 0 1 2 3 4 5 6 7
  * t          (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    magnitude  (x, y, z, t) float64 0.993 0.5573 0.3285 0.9038 1.044 0.8807 ...

`for_each_array` decorator ¶

for_each_array allows automates calling a function that takes a DataArray argument and returns a DataArray for each DataArray ( layer ) in a MLDataset :

In [15]:

@for_each_array
def plus_one(arr, **kw):
    return arr + 1

@for_each_array
def minus_one(arr, **kw):
    return arr - 1


plus = X.chain(plus_one)
minus = X.chain(minus_one)

assert np.all(plus.wind_x - minus.wind_x == 2.)
assert np.all(plus.temperature - minus.temperature == 2.)

In [16]:

@for_each_array
def transform_example(arr, **kw):
    up = arr.quantile(0.75, dim='z')
    low = arr.quantile(0.25, dim='z')
    median = arr.quantile(0.5, dim='z')
    return (arr - median) / (up - low)

X.chain(transform_example)

Out[16]:

<xarray.MLDataset>
Dimensions:      (t: 48, x: 20, y: 15, z: 8)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  * z            (z) int64 0 1 2 3 4 5 6 7
  * t            (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
    quantile     float64 0.5
Data variables:
    pressure     (x, y, z, t) float64 0.5874 0.4168 1.752 -0.5453 -0.6057 ...
    temperature  (x, y, z, t) float64 0.0413 -0.1025 -0.4854 -0.1276 0.07339 ...
    wind_x       (x, y, z, t) float64 0.7737 -1.761 -0.8992 -0.4616 1.172 ...
    wind_y       (x, y, z, t) float64 0.3701 0.129 -1.655 1.048 -1.106 ...

In [17]:

@for_each_array
def agg_example(arr, **kw):
    return arr.mean(dim='t').quantile(0.25, dim='z')

aggregated = X.chain((transform_example, agg_example))

In [18]:

aggregated

Out[18]:

<xarray.MLDataset>
Dimensions:      (x: 20, y: 15)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
    quantile     float64 0.25
Data variables:
    pressure     (x, y) float64 -0.01103 -0.04669 -0.05397 -0.04531 -0.1158 ...
    temperature  (x, y) float64 0.02512 -0.06629 -0.04142 -0.01822 -0.04365 ...
    wind_x       (x, y) float64 -0.104 -0.03628 -0.09501 -0.04018 -0.04926 ...
    wind_y       (x, y) float64 -0.09337 -0.06528 0.04342 -0.08975 ...

With data_vars_func decorated functions, anything dict -like, an MLDataset or xarray.Dataset may be returned and it will be converted to MLDataset :

In [19]:

from collections import OrderedDict
@data_vars_func
def f(wind_x, wind_y, temperature, pressure):
    mag = (wind_x ** 2 + wind_y ** 2) ** 0.5
    return OrderedDict([('mag', mag), ('temperature', temperature), ('pressure', pressure)])

f(X)

/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/dataset.py:378: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  both_data_and_coords = [k for k in data_vars if k in coords]

Out[19]:

<xarray.MLDataset>
Dimensions:      (t: 48, x: 20, y: 15, z: 8)
Coordinates:
  * x            (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * y            (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  * z            (z) int64 0 1 2 3 4 5 6 7
  * t            (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    mag          (x, y, z, t) float64 0.993 0.5573 0.3285 0.9038 1.044 ...
    temperature  (x, y, z, t) float64 0.5724 0.5373 0.3906 0.4199 0.4054 ...
    pressure     (x, y, z, t) float64 0.943 0.8551 0.8603 0.1391 0.1404 ...

In [20]:

feat = f(X).to_features()
feat

/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/dataset.py:378: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  both_data_and_coords = [k for k in data_vars if k in coords]

Out[20]:

<xarray.MLDataset>
Dimensions:   (layer: 3, space: 115200)
Coordinates:
  * space     (space) MultiIndex
  - x         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - t         (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...
  * layer     (layer) object 'mag' 'temperature' 'pressure'
Data variables:
    features  (space, layer) float64 0.993 0.5724 0.943 0.5573 0.5373 0.8551 ...

In [21]:

feat.features

Out[21]:

<xarray.DataArray 'features' (space: 115200, layer: 3)>
array([[ 0.992951,  0.572371,  0.943046],
       [ 0.557347,  0.537267,  0.855084],
       [ 0.328463,  0.390634,  0.860315],
       ..., 
       [ 0.843051,  0.740098,  0.410452],
       [ 0.584261,  0.006917,  0.946001],
       [ 1.193767,  0.650579,  0.766686]])
Coordinates:
  * space    (space) MultiIndex
  - x        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - z        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - t        (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
  * layer    (layer) object 'mag' 'temperature' 'pressure'

In [22]:

feat.features.values

Out[22]:

array([[ 0.99295053,  0.57237128,  0.94304574],
       [ 0.55734658,  0.53726692,  0.8550843 ],
       [ 0.32846259,  0.3906344 ,  0.86031506],
       ..., 
       [ 0.84305053,  0.74009848,  0.41045238],
       [ 0.58426141,  0.00691722,  0.94600064],
       [ 1.19376719,  0.65057938,  0.76668616]])

`xarray_filters.MLDataset.chain` ¶

.chain can be called on an MLDataset to run callables in sequence, passing an MLDataset between steps.

In [23]:

@for_each_array
def agg_x(arr, **kw):
    return arr.mean(dim='x')

@for_each_array
def agg_y(arr, **kw):
    return arr.mean(dim='y')

@for_each_array
def agg_z(arr, **kw):
    return arr.mean(dim='z')


time_series = X.chain((agg_x, agg_y, agg_z))
time_series

Out[23]:

<xarray.MLDataset>
Dimensions:      (t: 48)
Coordinates:
  * t            (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    pressure     (t) float64 0.508 0.4951 0.4975 0.5013 0.5014 0.5066 0.4998 ...
    temperature  (t) float64 0.5044 0.5052 0.5007 0.5057 0.4996 0.5017 ...
    wind_x       (t) float64 0.4913 0.5015 0.5028 0.4994 0.5049 0.4974 ...
    wind_y       (t) float64 0.5018 0.4952 0.5008 0.5008 0.5029 0.5041 ...

In [24]:

time_series.to_features().features

Out[24]:

<xarray.DataArray 'features' (t: 48, layer: 4)>
array([[ 0.508011,  0.50441 ,  0.491325,  0.501791],
       [ 0.495105,  0.505195,  0.501508,  0.495244],
       [ 0.497456,  0.500709,  0.50275 ,  0.500827],
       [ 0.501316,  0.505695,  0.499432,  0.500814],
       [ 0.501423,  0.499579,  0.504869,  0.502901],
       [ 0.506578,  0.501706,  0.497429,  0.504051],
       [ 0.499826,  0.5057  ,  0.496623,  0.504154],
       [ 0.497774,  0.494554,  0.506317,  0.49565 ],
       [ 0.49754 ,  0.502164,  0.503356,  0.489029],
       [ 0.500576,  0.501813,  0.501686,  0.503828],
       [ 0.506306,  0.501103,  0.492398,  0.50422 ],
       [ 0.494418,  0.49479 ,  0.500274,  0.496992],
       [ 0.509054,  0.498496,  0.507002,  0.502409],
       [ 0.498281,  0.483517,  0.491937,  0.496341],
       [ 0.493782,  0.498443,  0.496371,  0.497739],
       [ 0.506201,  0.493848,  0.503532,  0.501417],
       [ 0.490193,  0.492893,  0.49947 ,  0.507501],
       [ 0.493411,  0.506393,  0.499553,  0.489611],
       [ 0.491931,  0.503453,  0.509046,  0.497308],
       [ 0.500486,  0.50004 ,  0.504978,  0.500159],
       [ 0.502545,  0.506109,  0.501411,  0.497691],
       [ 0.505572,  0.508179,  0.502869,  0.497524],
       [ 0.493383,  0.499389,  0.504163,  0.50949 ],
       [ 0.509835,  0.505621,  0.488983,  0.494115],
       [ 0.51022 ,  0.500319,  0.496296,  0.504541],
       [ 0.49236 ,  0.485227,  0.501256,  0.502297],
       [ 0.496386,  0.501631,  0.493982,  0.489832],
       [ 0.486673,  0.496127,  0.496505,  0.50021 ],
       [ 0.514912,  0.493692,  0.499438,  0.50293 ],
       [ 0.505792,  0.496942,  0.502634,  0.500891],
       [ 0.498338,  0.516481,  0.504869,  0.495852],
       [ 0.493812,  0.496064,  0.504753,  0.494797],
       [ 0.500156,  0.492373,  0.493381,  0.514306],
       [ 0.491772,  0.504517,  0.485997,  0.501903],
       [ 0.489161,  0.498377,  0.493417,  0.499951],
       [ 0.501926,  0.508538,  0.506648,  0.502152],
       [ 0.512457,  0.494537,  0.490528,  0.498282],
       [ 0.49954 ,  0.500534,  0.507356,  0.499091],
       [ 0.498592,  0.497729,  0.497631,  0.496413],
       [ 0.507836,  0.491338,  0.496549,  0.493771],
       [ 0.500518,  0.500438,  0.50249 ,  0.500901],
       [ 0.49996 ,  0.503797,  0.496763,  0.504346],
       [ 0.504079,  0.506028,  0.498611,  0.495113],
       [ 0.497707,  0.508452,  0.504747,  0.504746],
       [ 0.510344,  0.499445,  0.498672,  0.498012],
       [ 0.501891,  0.492353,  0.496031,  0.489664],
       [ 0.505161,  0.503386,  0.503797,  0.50416 ],
       [ 0.501192,  0.503771,  0.498621,  0.513291]])
Coordinates:
  * t        (t) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * layer    (layer) object 'pressure' 'temperature' 'wind_x' 'wind_y'

Creating some synthetic rasters in MLDataset that are similar to LANDSAT imagery with 8 spectral layers:

In [25]:

layers = ['layer_{}'.format(idx) for idx in range(1, 9)]
shape = (200, 200)
rand_np_arr = lambda: np.random.normal(0, 1, shape)
coords = [('x', np.arange(shape[0])), ('y', np.arange(shape[1]))]
rand_data_arr = lambda: xr.DataArray(rand_np_arr(), coords=coords, dims=('x', 'y'))
data_vars = OrderedDict([(layer, rand_data_arr()) for layer in layers])
dset = MLDataset(data_vars)
dset

Out[25]:

<xarray.MLDataset>
Dimensions:  (x: 200, y: 200)
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
    layer_1  (x, y) float64 0.2057 -1.15 1.308 0.6295 0.9121 -1.244 -0.7768 ...
    layer_2  (x, y) float64 -0.3386 1.763 1.673 1.176 -1.307 -1.33 -1.343 ...
    layer_3  (x, y) float64 -0.9408 -1.036 1.102 0.4921 -0.516 -0.5561 ...
    layer_4  (x, y) float64 -1.742 0.6987 0.9873 -0.02936 -1.073 -0.1992 ...
    layer_5  (x, y) float64 -1.049 0.6421 -1.749 1.623 0.5875 -1.782 -1.646 ...
    layer_6  (x, y) float64 -1.397 -0.3429 -0.101 0.0783 0.07475 -0.1138 ...
    layer_7  (x, y) float64 -2.528 -1.201 -0.875 0.8027 -0.5513 -0.1087 ...
    layer_8  (x, y) float64 -0.6376 -0.05374 -0.2068 1.514 0.354 -0.01112 ...

Examples of chaining callables that use for_each_array and data_vars_func as decorators, where the example functions also show the variety of return data types allowed in functions decorated by data_vars_func .

Note the keep_arrays=True keyword argument in the function prototypes - this means that the original layers passed into the decorated functions will be part of the MLDataset outputs, even if the decorated functions do not return them.

In [26]:

from functools import partial
@for_each_array
def standardize(arr, dim=None, **kw):
    mean = arr.mean(dim=dim)
    std = arr.std(dim=dim)
    return (arr - mean) / std

@data_vars_func
def ndvi(layer_5, layer_4, keep_arrays=True):
    return OrderedDict([('ndvi', (layer_5 - layer_4) / (layer_5 + layer_4))])


@data_vars_func
def ndwi(layer_3, layer_5, keep_arrays=True, **kw):
    return {'ndwi': (layer_3 - layer_5) / (layer_3 + layer_5)}


@data_vars_func
def mndwi_36(layer_3, layer_6, keep_arrays=True):
    return xr.Dataset({'mndwi_36': (layer_3 - layer_6) / (layer_3 + layer_6)})


@data_vars_func
def mndwi_37(layer_3, layer_7, keep_arrays=True):
    return MLDataset(OrderedDict([('mndwi_37', (layer_3 - layer_7) / (layer_3 + layer_7))]))

normed_diffs = dset.chain((ndvi, ndwi, mndwi_36, mndwi_37))
standardized = dset.chain(partial(standardize, dim='x'))

/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/dataset.py:378: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  both_data_and_coords = [k for k in data_vars if k in coords]

In [27]:

normed_diffs

Out[27]:

<xarray.MLDataset>
Dimensions:   (x: 200, y: 200)
Coordinates:
  * x         (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * y         (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
    mndwi_37  (x, y) float64 -0.4575 -0.07409 8.71 -0.2399 -0.03307 0.673 ...
    mndwi_36  (x, y) float64 -0.1951 0.5026 1.202 0.7255 1.339 0.6604 ...
    ndwi      (x, y) float64 -0.05435 4.263 -4.403 -0.5347 -15.43 -0.5244 ...
    ndvi      (x, y) float64 -0.2483 -0.04214 3.591 1.037 -3.421 0.799 ...
    layer_1   (x, y) float64 0.2057 -1.15 1.308 0.6295 0.9121 -1.244 -0.7768 ...
    layer_2   (x, y) float64 -0.3386 1.763 1.673 1.176 -1.307 -1.33 -1.343 ...
    layer_3   (x, y) float64 -0.9408 -1.036 1.102 0.4921 -0.516 -0.5561 ...
    layer_4   (x, y) float64 -1.742 0.6987 0.9873 -0.02936 -1.073 -0.1992 ...
    layer_5   (x, y) float64 -1.049 0.6421 -1.749 1.623 0.5875 -1.782 -1.646 ...
    layer_6   (x, y) float64 -1.397 -0.3429 -0.101 0.0783 0.07475 -0.1138 ...
    layer_7   (x, y) float64 -2.528 -1.201 -0.875 0.8027 -0.5513 -0.1087 ...
    layer_8   (x, y) float64 -0.6376 -0.05374 -0.2068 1.514 0.354 -0.01112 ...

In [28]:

standardized

Out[28]:

<xarray.MLDataset>
Dimensions:  (x: 200, y: 200)
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
    layer_1  (x, y) float64 0.1133 -0.9913 1.208 0.5796 0.9299 -1.281 ...
    layer_2  (x, y) float64 -0.2191 1.937 1.725 1.348 -1.252 -1.393 -1.192 ...
    layer_3  (x, y) float64 -0.9028 -1.068 1.224 0.5139 -0.3531 -0.6708 ...
    layer_4  (x, y) float64 -1.582 0.7389 1.006 0.0176 -1.053 -0.2102 ...
    layer_5  (x, y) float64 -1.131 0.5838 -1.931 1.649 0.6675 -1.886 -1.708 ...
    layer_6  (x, y) float64 -1.215 -0.3089 -0.1669 0.1324 0.1582 -0.1556 ...
    layer_7  (x, y) float64 -2.392 -1.141 -0.745 0.6597 -0.6249 0.006187 ...
    layer_8  (x, y) float64 -0.4762 -0.1219 -0.1356 1.557 0.3148 -0.132 ...

Merging two MLDataset s and converting the merged output to a features 2-D DataArray :

In [29]:

catted = normed_diffs.merge(standardized, overwrite_vars=standardized.data_vars.keys())
catted = catted.to_features()

/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/merge.py:533: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  elif overwrite_vars == set(other):
/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/_collections_abc.py:743: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  for key in self._mapping:

In [30]:

catted.features

Out[30]:

<xarray.DataArray 'features' (space: 40000, layer: 12)>
array([[-0.45752 , -0.195089, -0.054352, ..., -1.214781, -2.391996, -0.476214],
       [-0.074091,  0.502572,  4.263453, ..., -0.30895 , -1.141291, -0.121904],
       [ 8.709869,  1.201793, -4.403228, ..., -0.166914, -0.745047, -0.135598],
       ..., 
       [-0.379463, -0.696751, -0.391354, ..., -0.697039, -0.374042,  0.890838],
       [-0.279059, -8.709046,  1.783529, ..., -1.063337,  1.727148, -0.778712],
       [-0.245155, -5.143229,  0.257451, ..., -1.842043,  2.277689, -0.09115 ]])
Coordinates:
  * space    (space) MultiIndex
  - x        (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - y        (space) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
  * layer    (layer) object 'mndwi_37' 'mndwi_36' 'ndwi' 'ndvi' 'layer_1' ...

In [31]:

catted.layer

Out[31]:

<xarray.DataArray 'layer' (layer: 12)>
array(['mndwi_37', 'mndwi_36', 'ndwi', 'ndvi', 'layer_1', 'layer_2', 'layer_3',
       'layer_4', 'layer_5', 'layer_6', 'layer_7', 'layer_8'], dtype=object)
Coordinates:
  * layer    (layer) object 'mndwi_37' 'mndwi_36' 'ndwi' 'ndvi' 'layer_1' ...

In [32]:

catted.from_features()

Out[32]:

<xarray.MLDataset>
Dimensions:   (x: 200, y: 200)
Coordinates:
  * x         (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * y         (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
    mndwi_37  (x, y) float64 -0.4575 -0.07409 8.71 -0.2399 -0.03307 0.673 ...
    mndwi_36  (x, y) float64 -0.1951 0.5026 1.202 0.7255 1.339 0.6604 ...
    ndwi      (x, y) float64 -0.05435 4.263 -4.403 -0.5347 -15.43 -0.5244 ...
    ndvi      (x, y) float64 -0.2483 -0.04214 3.591 1.037 -3.421 0.799 ...
    layer_1   (x, y) float64 0.1133 -0.9913 1.208 0.5796 0.9299 -1.281 ...
    layer_2   (x, y) float64 -0.2191 1.937 1.725 1.348 -1.252 -1.393 -1.192 ...
    layer_3   (x, y) float64 -0.9028 -1.068 1.224 0.5139 -0.3531 -0.6708 ...
    layer_4   (x, y) float64 -1.582 0.7389 1.006 0.0176 -1.053 -0.2102 ...
    layer_5   (x, y) float64 -1.131 0.5838 -1.931 1.649 0.6675 -1.886 -1.708 ...
    layer_6   (x, y) float64 -1.215 -0.3089 -0.1669 0.1324 0.1582 -0.1556 ...
    layer_7   (x, y) float64 -2.392 -1.141 -0.745 0.6597 -0.6249 0.006187 ...
    layer_8   (x, y) float64 -0.4762 -0.1219 -0.1356 1.557 0.3148 -0.132 ...

The following synthetic data example shows the logic above in this notebook can work for any number of dimensions, e.g. the 6-D DataArray s below:

In [33]:

shp = (2, 3, 4, 5, 6, 7)
dims = ('a', 'b', 'c', 'd', 'e', 'f')
coords = OrderedDict([(dim, np.arange(s)) for s, dim in zip(shp, dims)])
dset = MLDataset(OrderedDict([('layer_{}'.format(idx), 
                               xr.DataArray(np.random.normal(0, 10, shp),
                                            coords=coords,
                                            dims=dims)) 
                              for idx in range(6)]))
dset

Out[33]:

<xarray.MLDataset>
Dimensions:  (a: 2, b: 3, c: 4, d: 5, e: 6, f: 7)
Coordinates:
  * a        (a) int64 0 1
  * b        (b) int64 0 1 2
  * c        (c) int64 0 1 2 3
  * d        (d) int64 0 1 2 3 4
  * e        (e) int64 0 1 2 3 4 5
  * f        (f) int64 0 1 2 3 4 5 6
Data variables:
    layer_0  (a, b, c, d, e, f) float64 4.86 8.261 3.274 -10.64 -1.401 ...
    layer_1  (a, b, c, d, e, f) float64 -2.849 11.3 -1.716 10.55 14.69 ...
    layer_2  (a, b, c, d, e, f) float64 -0.9942 -10.1 -20.37 -5.836 -1.135 ...
    layer_3  (a, b, c, d, e, f) float64 4.382 -3.699 4.419 -2.466 -9.8 3.299 ...
    layer_4  (a, b, c, d, e, f) float64 -6.235 -6.037 14.62 0.3124 -7.587 ...
    layer_5  (a, b, c, d, e, f) float64 -22.43 2.969 -3.328 5.508 -6.908 ...

In [34]:

dset.layer_0.shape

Out[34]:

(2, 3, 4, 5, 6, 7)

With 6-D DataArray s, calling to_features creates a pandas.MultiIndex with 6 components:

In [35]:

dset.to_features()

Out[35]:

<xarray.MLDataset>
Dimensions:   (layer: 6, space: 5040)
Coordinates:
  * space     (space) MultiIndex
  - a         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - b         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - c         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - d         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - e         (space) int64 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 ...
  - f         (space) int64 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 ...
  * layer     (layer) object 'layer_0' 'layer_1' 'layer_2' 'layer_3' ...
Data variables:
    features  (space, layer) float64 4.86 -2.849 -0.9942 4.382 -6.235 -22.43 ...

The following cells demonstrate MLDataset.chain is the same as calling .pipe several times in sequence.

In [36]:

@for_each_array
def example_agg(arr, dim=None):
    return arr.std(dim=dim)

@data_vars_func
def layers_example_with_kw(**kw):
    new = OrderedDict([('new_layer_100', kw['layer_3'] + kw['layer_4'])])
    new.update(kw)
    return MLDataset(new)

@data_vars_func
def layers_example_named_args(layer_1, layer_2, new_layer_100):
    return MLDataset(OrderedDict([('final', new_layer_100 / (layer_1 + layer_2))]))

In [37]:

dset.pipe(example_agg, dim='a'
         ).pipe(example_agg, dim='b'
               ).pipe(layers_example_with_kw
                     ).pipe(layers_example_named_args).to_features()

/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/dataset.py:378: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  both_data_and_coords = [k for k in data_vars if k in coords]

Out[37]:

<xarray.MLDataset>
Dimensions:   (layer: 1, space: 840)
Coordinates:
  * space     (space) MultiIndex
  - c         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - d         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - e         (space) int64 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 ...
  - f         (space) int64 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 ...
  * layer     (layer) object 'final'
Data variables:
    features  (space, layer) float64 0.3976 0.6714 0.3157 0.4596 0.9187 ...

In [38]:

dset.chain([(example_agg, dict(dim='a')),
             (example_agg, dict(dim='b')),
             layers_example_with_kw,
             layers_example_named_args,
            ]).to_features()

/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/dataset.py:378: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  both_data_and_coords = [k for k in data_vars if k in coords]

Out[38]:

<xarray.MLDataset>
Dimensions:   (layer: 1, space: 840)
Coordinates:
  * space     (space) MultiIndex
  - c         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - d         (space) int64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
  - e         (space) int64 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 ...
  - f         (space) int64 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 ...
  * layer     (layer) object 'final'
Data variables:
    features  (space, layer) float64 0.3976 0.6714 0.3157 0.4596 0.9187 ...

In [39]:

flattened = dset.chain([(example_agg, dict(dim='a')),
                         (example_agg, dict(dim='b')),
                         layers_example_with_kw,
                         layers_example_named_args,
                        ]).to_features()

/home/travis/miniconda/envs/xarray_filters_py36/lib/python3.6/site-packages/xarray/core/dataset.py:378: FutureWarning: iteration over an xarray.Dataset will change in xarray v0.11 to only include data variables, not coordinates. Iterate over the Dataset.variables property instead to preserve existing behavior in a forwards compatible manner.
  both_data_and_coords = [k for k in data_vars if k in coords]

In [40]:

flattened.features.values[0:5, 0] = np.NaN

In [41]:

flattened.layer

Out[41]:

<xarray.DataArray 'layer' (layer: 1)>
array(['final'], dtype=object)
Coordinates:
  * layer    (layer) object 'final'

Preparing for ML Workflows ¶

Example collection of 4-D weather arrays ¶

Aggregating first ¶

xarray_filters.MLDataset.to_features ¶

data_vars_func decorator ¶

for_each_array decorator ¶

xarray_filters.MLDataset.chain ¶

`xarray_filters.MLDataset.to_features` ¶

`data_vars_func` decorator ¶

`for_each_array` decorator ¶

`xarray_filters.MLDataset.chain` ¶