Unexpected behaviour when chunking with multiple netcdf files in xarray/dask

Question

I'm working with a set of 468 netcdf files summing up to 12GB in total. Each file has only one global snapshot of a geophysical variable, ie for each file the data shape is (1, 1801, 3600) corresponding to dimensions ('time', 'latitude', 'longitude') .

My RAM is 8GB so I need chunking. I'm creating a xarray dataset using xarray.open_mfdataset and I have found that using the parameter chunk when calling xarray.open_mfdataset or doing a rechunking after with method .chunk has totally different outcomes. A similar issues was reported here without getting any response.

From the xarray documentation, chunking when calling xarray.open_dataset or when rechunking with .chunk should be exactly equivalent...

http://xarray.pydata.org/en/stable/dask.html

...but it doesn't seem so. I share here my examples.

1) CHUNKING WHEN CALLING xarray.open_mfdataset ALONG THE SPATIAL DIMENSIONS (longitude, latitude) HAVING THE TIME DIMENSION UNCHUNKED.

import xarray as xr

data1 = xr.open_mfdataset('/data/cds_downloads/2m_temperature/*.nc',
                          concat_dim='time', combine='nested',
                          chunks = {'longitude':400, 'latitude':200}) \
                         .chunk({'time':-1})
data1.t2m.data

with ProgressBar():
    data1.std('time').compute()

[########################################] | 100% Completed |  5min 44.1s

In this case everything works fine.

2) CHUNKING WITH METHOD .chunk ALONG THE SPATIAL DIMENSIONS (longitude, latitude) HAVING THE TIME DIMENSION UNCHUNKED.

data2=xr.open_mfdataset('/data/cds_downloads/2m_temperature/*.nc',
                        concat_dim='time',combine='nested') \
                       .chunk({'time': -1, 'longitude':400, 'latitude':200})
data2.t2m.data

As this image shows, apparently the chunking is now exactly the same than in 1). However...

with ProgressBar():
    data2.std('time').compute()

[#####################################   ] | 93% Completed |  1min 50.8s

...the computation of the std could not finish, the jupyter notebook kernel died without message due to exceeding the memory limit as I could checked monitoring with htop ... This likely implies that the chunking was indeed not taking place in reality and all the dataset without chunks is being loaded in to memory.

3) CHUNKING WHEN CALLING xarray.open_mfdataset ALONG THE SPATIAL DIMENSIONS (longitude, latitude) AND LEAVING THE TIME DIMENSION CHUNKED BY DEFAULT (ONE CHUNK PER FILE).

In theory this case should be much slower that 1) since the computation of std is done along the time dimension and thus much more chunks are generated unnecessarily (421420 chunks now vs 90 chunks in (1)).

data3 = xr.open_mfdataset('/data/cds_downloads/2m_temperature/*.nc',
                          concat_dim='time', combine='nested',
                          chunks = {'longitude':400, 'latitude':200})
data3.t2m.data

with ProgressBar():
    data3.std('time').compute()

[########################################] | 100% Completed |  5min 51.2s

However there is no memory problems and the amount of time required for the computation is almost the same than in case 1). This again suggests that method .chunk seems to be not working properly.

Anyone knows if this makes sense or how to solve this issue? I would need to be able to change the chunking depending on the specific computation I need to do.

Thanks

PD: I'm using xarray version 0.15.1

Answer 1

I would need to be able to change the chunking depending on the specific computation I need to do.

Yes, computations will be highly sensitive to chunk structure.

Chunking as early as possible in a computation (ideally when you're reading in data) is ideal because that makes the overall computation simpler.

In general I recommend larger chunk sizes. See https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-graphs

Unexpected behaviour when chunking with multiple netcdf files in xarray/dask

Question

1 answers

solution1
0 2020-08-08 00:40:20

Unexpected behaviour when chunking with multiple netcdf files in xarray/dask

Question

1 answers

solution1 0 2020-08-08 00:40:20

solution1
0 2020-08-08 00:40:20