简体   繁体   English

如果已创建 dask.distributed 客户端,xarray.open_mfdataset() 将不起作用

[英]xarray.open_mfdataset() doesn't work if dask.distributed client has been created

I have a bit of a weird problem that I'd appreciate some input on.我有一个奇怪的问题,我很感激一些意见。 Basically, I'm running a notebook on the AWS Pangeo Cloud and am opening some GOES-16 satellite data on S3 (with s3fs) with xr.open_mfdatase t.基本上,我在 AWS Pangeo Cloud 上运行一个笔记本,并使用xr.open_mfdatase t 在 S3(使用 s3fs)上打开一些 GOES-16 卫星数据。

This works great if I don't use dask at all, the dataset is constructed in a couple minutes.如果我根本不使用 dask,这非常有用,数据集在几分钟内就构建好了。

But if I create a dask.distributed client before I open the files, the open_mfdataset hangs, seemingly forever.但是如果我在打开文件之前创建了一个dask.distributed客户端, open_mfdataset挂起,似乎永远挂起。

I made some simple notebooks that can be explored here, as well as a binder link that can be used to run them.我制作了一些可以在这里探索的简单笔记本,以及可用于运行它们的活页夹链接。 Any input would be appreciated!任何输入将不胜感激!

https://github.com/lsterzinger/pangeo-cloud-L2-satellite/tree/main/dask_troubleshooting https://github.com/lsterzinger/pangeo-cloud-L2-satellite/tree/main/dask_troubleshooting

Would the following achieve what you are after?以下内容会实现您的目标吗?

ds = xr.open_mfdataset(file_objs, combine='nested', concat_dim='t', data_vars='minimal', coords='minimal', compat='override')

Note that the non-dask version loads in about 35 seconds with these settings, while dask one seems to be on the scale of 90 seconds.请注意,使用这些设置,非 dask 版本在大约 35 秒内加载,而 dask 版本似乎在 90 秒的范围内。 I haven't worked with this data, so don't know if it's the case here, but it is possible that the scaling advantages will kick-in for a larger number of files (right now it's 24).我没有处理过这些数据,所以不知道这里是否是这种情况,但扩展优势可能会为更多的文件(现在是 24 个)发挥作用。

This is based on the guidance in the docs :这是基于文档中的指导:

Commonly, a few of these variables need to be concatenated along a dimension (say "time"), while the rest are equal across the datasets (ignoring floating point differences).通常,其中一些变量需要沿一个维度(比如“时间”)连接,而 rest 在数据集中是相等的(忽略浮点差异)。

This command concatenates variables along the "time" dimension, but only those that already contain the "time" dimension (data_vars='minimal', coords='minimal').此命令沿“时间”维度连接变量,但仅连接那些已经包含“时间”维度的变量(data_vars='minimal',coords='minimal')。 Variables that lack the "time" dimension are taken from the first dataset (compat='override').缺少“时间”维度的变量取自第一个数据集(compat='override')。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM