使用 xarray/dask 中的多个 netcdf 文件分块时的意外行为

Question

I'm working with a set of 468 netcdf files summing up to 12GB in total.我正在使用一组 468 个 netcdf 文件，总大小为 12GB。 Each file has only one global snapshot of a geophysical variable, ie for each file the data shape is (1, 1801, 3600) corresponding to dimensions ('time', 'latitude', 'longitude') .每个文件只有一个地球物理变量的全局快照，即每个文件的数据形状是(1, 1801, 3600)对应于维度('time', 'latitude', 'longitude') 。

My RAM is 8GB so I need chunking.我的 RAM 是 8GB，所以我需要分块。 I'm creating a xarray dataset using xarray.open_mfdataset and I have found that using the parameter chunk when calling xarray.open_mfdataset or doing a rechunking after with method .chunk has totally different outcomes.我正在使用xarray.open_mfdataset创建一个 xarray 数据集，我发现在调用xarray.open_mfdataset或使用方法.chunk进行重新分块时使用参数块具有完全不同的结果。 A similar issues was reported here without getting any response. 这里报告了一个类似的问题，没有得到任何回应。

From the xarray documentation, chunking when calling xarray.open_dataset or when rechunking with .chunk should be exactly equivalent...从 xarray 文档中，调用xarray.open_dataset或使用.chunk分块时的分块应该完全相同......

http://xarray.pydata.org/en/stable/dask.html http://xarray.pydata.org/en/stable/dask.html

...but it doesn't seem so. ……但似乎并非如此。 I share here my examples.我在这里分享我的例子。

1) CHUNKING WHEN CALLING xarray.open_mfdataset ALONG THE SPATIAL DIMENSIONS (longitude, latitude) HAVING THE TIME DIMENSION UNCHUNKED. 1) 沿空间维度（经度、纬度） xarray.open_mfdataset时分块，时间维度未分块。

import xarray as xr

data1 = xr.open_mfdataset('/data/cds_downloads/2m_temperature/*.nc',
                          concat_dim='time', combine='nested',
                          chunks = {'longitude':400, 'latitude':200}) \
                         .chunk({'time':-1})
data1.t2m.data

with ProgressBar():
    data1.std('time').compute()

[########################################] | 100% Completed |  5min 44.1s

In this case everything works fine.在这种情况下，一切正常。

2) CHUNKING WITH METHOD .chunk ALONG THE SPATIAL DIMENSIONS (longitude, latitude) HAVING THE TIME DIMENSION UNCHUNKED. 2) 使用方法.chunk沿空间维度（经度、纬度）进行分块，而时间维度未分块。

data2=xr.open_mfdataset('/data/cds_downloads/2m_temperature/*.nc',
                        concat_dim='time',combine='nested') \
                       .chunk({'time': -1, 'longitude':400, 'latitude':200})
data2.t2m.data

As this image shows, apparently the chunking is now exactly the same than in 1).如图所示，显然现在的分块与 1) 中的完全相同。 However...然而...

with ProgressBar():
    data2.std('time').compute()

[#####################################   ] | 93% Completed |  1min 50.8s

...the computation of the std could not finish, the jupyter notebook kernel died without message due to exceeding the memory limit as I could checked monitoring with htop ... This likely implies that the chunking was indeed not taking place in reality and all the dataset without chunks is being loaded in to memory. ... std 的计算无法完成，jupyter 笔记本 kernel 由于超出 memory 限制而没有消息而死，因为我可以使用htop检查监视 ...这可能意味着分块确实没有发生在现实中没有块的数据集正在加载到 memory。

3) CHUNKING WHEN CALLING xarray.open_mfdataset ALONG THE SPATIAL DIMENSIONS (longitude, latitude) AND LEAVING THE TIME DIMENSION CHUNKED BY DEFAULT (ONE CHUNK PER FILE). 3) 沿空间维度（经度、纬度）调用xarray.open_mfdataset时进行分块，并保留默认分块的时间维度（每个文件一个块）。

In theory this case should be much slower that 1) since the computation of std is done along the time dimension and thus much more chunks are generated unnecessarily (421420 chunks now vs 90 chunks in (1)).从理论上讲，这种情况应该比 1) 慢得多，因为std的计算是沿时间维度完成的，因此不必要地生成了更多的块（现在 421420 个块与 (1) 中的 90 个块）。

data3 = xr.open_mfdataset('/data/cds_downloads/2m_temperature/*.nc',
                          concat_dim='time', combine='nested',
                          chunks = {'longitude':400, 'latitude':200})
data3.t2m.data

with ProgressBar():
    data3.std('time').compute()

[########################################] | 100% Completed |  5min 51.2s

However there is no memory problems and the amount of time required for the computation is almost the same than in case 1).然而，没有 memory 问题，计算所需的时间几乎与情况 1) 相同。 This again suggests that method .chunk seems to be not working properly.这再次表明方法.chunk似乎无法正常工作。

Anyone knows if this makes sense or how to solve this issue?任何人都知道这是否有意义或如何解决这个问题？ I would need to be able to change the chunking depending on the specific computation I need to do.我需要能够根据我需要做的具体计算来改变分块。

Thanks谢谢

PD: I'm using xarray version 0.15.1 PD：我正在使用 xarray 版本 0.15.1

Answer 1

I would need to be able to change the chunking depending on the specific computation I need to do.我需要能够根据我需要做的具体计算来改变分块。

Yes, computations will be highly sensitive to chunk structure.是的，计算将对块结构高度敏感。

Chunking as early as possible in a computation (ideally when you're reading in data) is ideal because that makes the overall computation simpler.在计算中尽可能早地进行分块（最好是在您读取数据时）是理想的，因为这使整体计算更简单。

In general I recommend larger chunk sizes.一般来说，我推荐更大的块大小。 See https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-graphs请参阅https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-graphs

使用 xarray/dask 中的多个 netcdf 文件分块时的意外行为

问题描述

1 个解决方案

解决方案1
0 2020-08-08 00:40:20

使用 xarray/dask 中的多个 netcdf 文件分块时的意外行为

问题描述

1 个解决方案

解决方案1 0 2020-08-08 00:40:20

解决方案1
0 2020-08-08 00:40:20