沿着dask数组的轴应用函数

Question

I'm analyzing ocean temperature data from a climate model simulation where the 4D data arrays (time, depth, latitude, longitude; denoted dask_array below) typically have a shape of (6000, 31, 189, 192) and a size of ~25GB (hence my desire to use dask; I've been getting memory errors trying to process these arrays using numpy). 我正在分析来自气候模型模拟的海洋温度数据，其中4D数据阵列（时间，深度，纬度，经度;下面表示为dask_array ）通常具有（ dask_array ）的形状和~25GB的大小（因此我希望使用dask;我一直在尝试使用numpy处理这些数组时出现内存错误）。

I need to fit a cubic polynomial along the time axis at each level / latitude / longitude point and store the resulting 4 coefficients. 我需要在每个水平/纬度/经度点沿时间轴拟合三次多项式并存储得到的4个系数。 I've therefore set chunksize=(6000, 1, 1, 1) so I have a separate chunk for each grid point. 因此我设置了chunksize=(6000, 1, 1, 1)所以每个网格点都有一个单独的块。

This is my function for getting the coefficients of the cubic polynomial (the time_axis axis values are a global 1D numpy array defined elsewhere): 这是我获取三次多项式系数的函数（ time_axis轴值是其他地方定义的全局1D numpy数组）：

def my_polyfit(data):    
    return numpy.polyfit(data.squeeze(), time_axis, 3)

(So in this case, numpy.polyfit returns a list of length 4) （所以在这种情况下， numpy.polyfit返回一个长度为4的列表）

and this is the command I thought I'd need to apply it to each chunk: 这是我认为我需要将它应用于每个块的命令：

dask_array.map_blocks(my_polyfit, chunks=(4, 1, 1, 1), drop_axis=0, new_axis=0).compute()

Whereby the time axis is now gone (hence drop_axis=0 ) and there's a new coefficient axis in it's place (of length 4). 因此时间轴现在消失了（因此drop_axis=0 ）并且在它的位置（长度为4）有一个新的系数轴。

When I run this command I get IndexError: tuple index out of range , so I'm wondering where/how I've misunderstood the use of map_blocks ? 当我运行这个命令时，我得到IndexError: tuple index out of range ，所以我想知道我在哪里/怎么误解了map_blocks的使用？

Answer 1

I suspect that your experience will be smoother if your function returns an array of the same dimension that it consumes. 我怀疑如果你的函数返回一个它消耗的相同维度的数组，你的体验会更顺畅。 Eg you might consider defining your function as follows: 例如，您可以考虑按如下方式定义函数：

def my_polyfit(data):
    return np.polyfit(data.squeeze(), ...)[:, None, None, None]

Then you can probably ignore the new_axis , drop_axis bits. 然后你可以忽略new_axis ， drop_axis位。

Performance-wise you might also want to consider using a larger chunksize. 在性能方面，您可能还想考虑使用更大的块。 At 6000 numbers per chunk you have over a million chunks, which means you'll probably spend more time in scheduling than in actual computation. 每个块有6000个数字，你有超过一百万个块，这意味着你可能会花在调度上的时间比实际计算上多。 Generally I shoot for chunks that are a few megabytes in size. 一般来说，我拍摄的是几兆字节的块。 Of course, increasing chunksize would cause your mapped function to become more complex. 当然，增加chunksize会导致映射函数变得更加复杂。

Example 例

In [1]: import dask.array as da

In [2]: import numpy as np

In [3]: def f(b):
    return np.polyfit(b.squeeze(), np.arange(5), 3)[:, None, None, None]
   ...: 

In [4]: x = da.random.random((5, 3, 3, 3), chunks=(5, 1, 1, 1))

In [5]: x.map_blocks(f, chunks=(4, 1, 1, 1)).compute()
Out[5]: 
array([[[[ -1.29058580e+02,   2.21410738e+02,   1.00721521e+01],
         [ -2.22469851e+02,  -9.14889627e+01,  -2.86405832e+02],
         [  1.40415805e+02,   3.58726232e+02,   6.47166710e+02]],
         ...

Answer 2

Kind of late to the party, but figured this could use an alternative answer based on new features in Dask. 派对迟到了，但认为这可以使用基于Dask新功能的替代答案。 In particular, we added apply_along_axis , which behaves basically like NumPy's apply_along_axis except for Dask Arrays instead. 特别是，我们添加了apply_along_axis ，它的行为基本上类似于NumPy的apply_along_axis ，而不是Dask Arrays。 This results in somewhat simpler syntax. 这导致语法稍微简单一些。 Also it avoids the need to rechunk your data before applying your custom function to each 1-D piece and makes no real requirements of your initial chunking, which it tries to preserve in the end result (excepting the axis that is either reduced or replaced). 此外，它还避免了在将自定义函数应用于每个1-D部分之前重新分组数据的需要，并且对初始分块没有实际要求，它会尝试保留最终结果（除了减少或替换的轴除外）。

In [1]: import dask.array as da

In [2]: import numpy as np

In [3]: def f(b):
   ...:     return np.polyfit(b, np.arange(len(b)), 3)
   ...: 

In [4]: x = da.random.random((5, 3, 3, 3), chunks=(5, 1, 1, 1))

In [5]: da.apply_along_axis(f, 0, x).compute()
Out[5]: 
array([[[[  2.13570599e+02,   2.28924503e+00,   6.16369231e+01],
         [  4.32000311e+00,   7.01462518e+01,  -1.62215514e+02],
         [  2.89466687e+02,  -1.35522215e+02,   2.86643721e+02]],
         ...

沿着dask数组的轴应用函数

问题描述

2 个解决方案

解决方案1
5 已采纳 2016-03-29 04:09:22

Example 例

解决方案2
1 2017-10-19 19:05:25

沿着dask数组的轴应用函数

问题描述

2 个解决方案

解决方案1 5 已采纳 2016-03-29 04:09:22

Example 例

解决方案2 1 2017-10-19 19:05:25

解决方案1
5 已采纳 2016-03-29 04:09:22

解决方案2
1 2017-10-19 19:05:25