简体   繁体   English

在 xarray 上滚动 window 的分箱变量

[英]Binning variable with rolling window on xarray

I have an xarray.Dataset with temperature data and want to calculate the binned temperature for every element of the array using a rolling-window of 7-days.我有一个带有温度数据的xarray.Dataset并希望使用 7 天的滚动窗口计算数组中每个元素的分箱温度。

I have data in this form:我有这种形式的数据:

import xarray as xr

ds = xr.Dataset(
    {'t2m': (['time', 'lat', 'lon'], t2m)},
    coords={
        'lon': lon,
        'lat': lat,
        'time': time,
    }
)

And then I use the rolling() method and apply a function on each window array:然后我使用rolling()方法并在每个 window 数组上应用 function :

r = ds.t2m.\
chunk({'time': 10}).\
rolling(time=7)

window_results = []
for label, arr_window in tqdm(r):
    max_temp = arr_window.max(dim=...).values
    min_temp = arr_window.min(dim=...).values
    if not np.isnan(max_temp):
        bins = np.arange(min_temp, max_temp, 2)
        
        buckets = np.digitize(arr_window.isel(time=-1),
                              bins=bins)
        buckets_arr = xr.DataArray(buckets,
                                   dims={
                                       'lat': arr_window.lat.values,
                                       'lon': arr_window.lon.values
                                   })
        buckets_arr = buckets_arr.assign_coords({'time': label})

        window_results.append(buckets_arr)

At the end, I get a list of each timestep with a window-calculation of binned arrays:最后,我得到了每个时间步长的列表,其中包含分箱 arrays 的窗口计算:

ds_concat = xr.concat(window_results, dim='time')
ds_concat

>> <xarray.DataArray (time: 18, lat: 10, lon: 10)>
array([[[1, 2, 2, ..., 2, 2, 3],
        [1, 3, 3, ..., 1, 1, 2],
        [2, 3, 2, ..., 1, 2, 3],
        ...,
        [2, 2, 2, ..., 2, 2, 2],
        [2, 2, 2, ..., 1, 2, 2],
        [2, 2, 3, ..., 2, 3, 2]],
...

This code is yielding the results I am looking for, but I believe there must be a better alternative to apply this same process either using apply_ufunc or dask .这段代码产生了我正在寻找的结果,但我相信必须有更好的替代方法来使用apply_ufuncdask应用相同的过程。 I am also using a dask.distributed.Client , so I am looking for a way to optimize my code to run fast.我也在使用dask.distributed.Client ,所以我正在寻找一种方法来优化我的代码以快速运行。

Any help is appreciated任何帮助表示赞赏

I finally figure it out.我终于想通了。 Hope this can help someone with the same problem.希望这可以帮助有同样问题的人。

One of the coolest features of dask.distributed is dask.delayed . dask.delayed dask.distributed I can re-write the loop above and use a lazy function:我可以重写上面的循环并使用惰性 function:

import dask
import xarray as xr

@dask.delayed
def create_bucket_window(arr, label):
    
    max_temp = arr.max(dim=...).values
    min_temp = arr.min(dim=...).values
    
    if not np.isnan(max_temp):
        bins = np.arange(min_temp, max_temp, 2)
        buckets = np.digitize(arr.isel(time=-1),
                              bins=bins)
        buckets_arr = xr.DataArray(buckets,
                                   dims={
                                       'lat': arr.lat.values,
                                       'lon': arr.lon.values
                                   })
        buckets_arr = buckets_arr.assign_coords({'time': label})

        return buckets_arr

and then:接着:

window_results = []
for label, arr_window in tqdm(r):
    bucket_array = create_bucket_window(arr=arr_window,
                                        label=label)
    window_results.append(bucket_array)

Once I do this, dask will lazy-generate this arrays, and will only evaluate them when needed:一旦我这样做了, dask将延迟生成这个 arrays,并且只会在需要时评估它们:

dask.compute(*window_results)

And there you will have a collection of results!在那里,您将获得一系列结果!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM