[英]Binning variable with rolling window on xarray
I have an xarray.Dataset
with temperature data and want to calculate the binned temperature for every element of the array using a rolling-window of 7-days.我有一个带有温度数据的xarray.Dataset
并希望使用 7 天的滚动窗口计算数组中每个元素的分箱温度。
I have data in this form:我有这种形式的数据:
import xarray as xr
ds = xr.Dataset(
{'t2m': (['time', 'lat', 'lon'], t2m)},
coords={
'lon': lon,
'lat': lat,
'time': time,
}
)
And then I use the rolling()
method and apply a function on each window array:然后我使用rolling()
方法并在每个 window 数组上应用 function :
r = ds.t2m.\
chunk({'time': 10}).\
rolling(time=7)
window_results = []
for label, arr_window in tqdm(r):
max_temp = arr_window.max(dim=...).values
min_temp = arr_window.min(dim=...).values
if not np.isnan(max_temp):
bins = np.arange(min_temp, max_temp, 2)
buckets = np.digitize(arr_window.isel(time=-1),
bins=bins)
buckets_arr = xr.DataArray(buckets,
dims={
'lat': arr_window.lat.values,
'lon': arr_window.lon.values
})
buckets_arr = buckets_arr.assign_coords({'time': label})
window_results.append(buckets_arr)
At the end, I get a list of each timestep with a window-calculation of binned arrays:最后,我得到了每个时间步长的列表,其中包含分箱 arrays 的窗口计算:
ds_concat = xr.concat(window_results, dim='time')
ds_concat
>> <xarray.DataArray (time: 18, lat: 10, lon: 10)>
array([[[1, 2, 2, ..., 2, 2, 3],
[1, 3, 3, ..., 1, 1, 2],
[2, 3, 2, ..., 1, 2, 3],
...,
[2, 2, 2, ..., 2, 2, 2],
[2, 2, 2, ..., 1, 2, 2],
[2, 2, 3, ..., 2, 3, 2]],
...
This code is yielding the results I am looking for, but I believe there must be a better alternative to apply this same process either using apply_ufunc
or dask
.这段代码产生了我正在寻找的结果,但我相信必须有更好的替代方法来使用apply_ufunc
或dask
应用相同的过程。 I am also using a dask.distributed.Client
, so I am looking for a way to optimize my code to run fast.我也在使用dask.distributed.Client
,所以我正在寻找一种方法来优化我的代码以快速运行。
Any help is appreciated任何帮助表示赞赏
I finally figure it out.我终于想通了。 Hope this can help someone with the same problem.希望这可以帮助有同样问题的人。
One of the coolest features of dask.distributed
is dask.delayed
. dask.delayed
dask.distributed
I can re-write the loop above and use a lazy function:我可以重写上面的循环并使用惰性 function:
import dask
import xarray as xr
@dask.delayed
def create_bucket_window(arr, label):
max_temp = arr.max(dim=...).values
min_temp = arr.min(dim=...).values
if not np.isnan(max_temp):
bins = np.arange(min_temp, max_temp, 2)
buckets = np.digitize(arr.isel(time=-1),
bins=bins)
buckets_arr = xr.DataArray(buckets,
dims={
'lat': arr.lat.values,
'lon': arr.lon.values
})
buckets_arr = buckets_arr.assign_coords({'time': label})
return buckets_arr
and then:接着:
window_results = []
for label, arr_window in tqdm(r):
bucket_array = create_bucket_window(arr=arr_window,
label=label)
window_results.append(bucket_array)
Once I do this, dask
will lazy-generate this arrays, and will only evaluate them when needed:一旦我这样做了, dask
将延迟生成这个 arrays,并且只会在需要时评估它们:
dask.compute(*window_results)
And there you will have a collection of results!在那里,您将获得一系列结果!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.